Re: [zfs-discuss] Help! System panic when pool imported

2009-09-25 Thread Albert Chin
On Fri, Sep 25, 2009 at 05:21:23AM +, Albert Chin wrote:
 [[ snip snip ]]
 
 We really need to import this pool. Is there a way around this? We do
 have snv_114 source on the system if we need to make changes to
 usr/src/uts/common/fs/zfs/dsl_dataset.c. It seems like the zfs
 destroy transaction never completed and it is being replayed, causing
 the panic. This cycle continues endlessly.

What are the implications of adding the following to /etc/system:
  set zfs:zfs_recover=1
  set aok=1

And importing the pool with:
  # zpool import -o ro

-- 
albert chin (ch...@thewrittenword.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Persistent errors - do I believe?

2009-09-25 Thread Chris Murray
Cheers, I did try that, but still got the same total on import - 2.73TB

I even thought I might have just made a mistake with the numbers, so I made a 
sort of 'quarter scale model' in VMware and OSOL 2009.06, with 3x250G and 
1x187G. That gave me a size of 744GB, which is *approx* 1/4 of what I get in 
the physical machine. That makes sense. I then replaced the 187 with another 
250, still 744GB total, as expected. Exported  imported - now 996GB. So, the 
export and import process seems to be the thing to do, but why it's not working 
on my physical machine (SXCE119) is a mystery. I even contemplated that there 
might have still been a 750GB drive left in the setup, but they're all 1TB 
(well, 931.51GB).

Any ideas what else it could be?

For anyone interested in the checksum/permanent error thing, I'm running a 
scrub now. 59% done and not one error.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Casper . Dik

On Fri, 25 Sep 2009, James Lever wrote:

 NFS Version 3 introduces the concept of safe asynchronous writes.?

Being safe then requires a responsibilty level on the client which 
is often not present.  For example, if the server crashes, and then 
the client crashes, how does the client resend the uncommitted data? 
If the client had a non-volatile storage cache, then it would be able 
to responsibly finish the writes that failed.

If the client crashes, it is clear that work will be lost up to the point
that the client did a successful commit.  Other than support for the
NFSv3 commit operation and resending the missing operations.
If the client crashes, we know that non-committed operations may be dropped
in the floor.

The commentary says that normally the COMMIT operations occur during 
close(2) or fsync(2) system call, or when encountering memory 
pressure.  If the problem is slow copying of many small files, this 
COMMIT approach does not help very much since very little data is sent 
per file and most time is spent creating directories and files.

Indeed; the commit is mostly to make sure that the pipe between the server
and the client can be filled for write operations.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Nathan
While I am about to embark on building a home NAS box using OpenSolaris with 
ZFS.

Currently I have a chassis that will hold 16 hard drives, although not in 
caddies - down time doesn't bother me if I need to switch a drive, probably 
could do it running anyways just a bit of a pain. :)

I am after suggestions of motherboard, CPU and ram.  Basically I want ECC ram 
and at least two PCI-E x4 channels.  As I want to run 2 x AOC-USAS_L8i cards 
for 16 drives.

I want something with a bit of guts but over the top.  I know the HCL is there 
but I want to see what other people are using in their solutions.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Collecting hardware configurations (was Re: White box server for OpenSolaris)

2009-09-25 Thread Tim Foster

On Fri, 2009-09-25 at 01:32 -0700, Erik Trimble wrote:
 Go back and look through the archives for this list. We just had this 
 discussion last month. Let's not rehash it again, as it seems to get 
 redone way too often.

You know, this seems like such a common question to the list, would we
(the zfs community) be interested in coming up with a rolling set of
'recommended' systems that home users could use as a reference, rather
than requiring people to trawl through the archives each time?


Perhaps a few tiers, with as many user-submitted systems per-tier as we
get.

 * small  boot disk + 2 or 3 disks, low power, quiet, small media server
 * medium boot disk + 3 - 9 disks, home office, larger media server
 * large  boot disk + 9 or more disks, thumper-esque

and keep them up to date as new hardware becomes available, with a bit
of space on a website somewhere to manage them.

These could either be off-the-shelf dedicated NAS systems, or
build-to-order machines, but getting their configuration 
last-known-price would be useful.

I don't have enough experience myself in terms of knowing what's the
best hardware on the market, but from time to time, I do think about
upgrading my system at home, and would really appreciate a
zfs-community-recommended configuration to use.

Any takers?

cheers,
tim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Collecting hardware configurations (was Re: White box server for OpenSolaris)

2009-09-25 Thread Eugen Leitl
On Fri, Sep 25, 2009 at 10:18:15AM +0100, Tim Foster wrote:

 I don't have enough experience myself in terms of knowing what's the
 best hardware on the market, but from time to time, I do think about
 upgrading my system at home, and would really appreciate a
 zfs-community-recommended configuration to use.
 
 Any takers?

I'm willing to contribute (zfs on Opensolaris, mostly Supermicro
boxes and FreeNAS (FreeBSD 7.2, next 8.x probably)). Is there a 
wiki for that somewhere?

-- 
Eugen* Leitl a href=http://leitl.org;leitl/a http://leitl.org
__
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Strange issue with ZFS and du (snv_118 and snv_123)

2009-09-25 Thread Christiano Blasi
Hi Guys,

maybe someone has some time to take a look at my issue, I didn't find a answer 
using the search.

Here we go:

I was running a backup of a directory located on a ZFS pool named TimeMachine, 
before I started the job, I checked the size of the directory called NFS, and 
du -h or du -s was telling me 25GB for /TimeMachine/NFS. So I started the job 
after a while I was very surprised that the backup app requested a new tape, 
and the a new one again, so in total 3 tapes (lto1) for 25GB ??!?!

After that I checked  the directory again:

r...@fileserver:/TimeMachine/NFS# du -h .
25G   
r...@fileserver:/TimeMachine/NFS# du -s
25861519

r...@fileserver:/TimeMachine/NFS# ls -lh
total 25G
-rw-r--r-- 1 root root 232G 2009-09-25 14:04 nfs.tar

 zfs list TimeMachine/NFS
NAME  USED  AVAIL  REFER  MOUNTPOINT
TimeMachine/NFS  24.7G   818G  24.7G  /TimeMachine/NFS

Also, if I use nautilus under Gnome, he also tells me that Directory NFS used 
232GB and not 24.7GB as du and zfs list  reports to me ?!?! Same if I mount 
that share (AFP) from a Mac and via NFS, still got 232GB used for 
TimeMachine/NFS.

r...@fileserver:/Data/nfs_org# ls -lh
total 232G
-rw-r--r-- 1 root root 232G 2009-09-24 17:57 nfs.tar
r...@fileserver:/Data/nfs_org# du -h .
232G.
r...@fileserver:/Data/nfs_org# 

I've upgraded from snv_118 to snv_123 but still the same. I also copy the 
contend of the directory to another ZFS spool, removed the org content and copy 
it back again, but I still get an incorrect value!

 pool: TimeMachine
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
TimeMachine  ONLINE   0 0 0
  raidz1ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c5t0d0  ONLINE   0 0 0
c6t0d0  ONLINE   0 0 0
c5t1d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0

errors: No known data errors

r...@fileserver:/TimeMachine/NFS# zfs get all TimeMachine
NAME PROPERTY  VALUE  SOURCE
TimeMachine  type  filesystem -
TimeMachine  creation  Sat Feb 28 17:48 2009  -
TimeMachine  used  2.76T  -
TimeMachine  available 818G   -
TimeMachine  referenced1.24T  -
TimeMachine  compressratio 1.00x  -
TimeMachine  mounted   yes-
TimeMachine  quota none   default
TimeMachine  reservation   none   default
TimeMachine  recordsize128K   default
TimeMachine  mountpoint/TimeMachine   default
TimeMachine  sharenfs  offdefault
TimeMachine  checksum  on default
TimeMachine  compression   offdefault
TimeMachine  atime on default
TimeMachine  devices   on default
TimeMachine  exec  on default
TimeMachine  setuidon default
TimeMachine  readonly  offdefault
TimeMachine  zoned offdefault
TimeMachine  snapdir   hidden default
TimeMachine  aclmode   groupmask  default
TimeMachine  aclinheritrestricted default
TimeMachine  canmount  on default
TimeMachine  shareiscsioffdefault
TimeMachine  xattr on default
TimeMachine  copies1  default
TimeMachine  version   3  -
TimeMachine  utf8only  off-
TimeMachine  normalization none   -
TimeMachine  casesensitivity   sensitive  -
TimeMachine  vscan offdefault
TimeMachine  nbmandoffdefault
TimeMachine  sharesmb  offdefault
TimeMachine  refquota  none   default
TimeMachine  refreservationnone   default
TimeMachine  primarycache  alldefault
TimeMachine  secondarycachealldefault
TimeMachine  usedbysnapshots   0  -
TimeMachine  usedbydataset 1.24T  -
TimeMachine  usedbychildren1.53T  -
TimeMachine  usedbyrefreservation  0  -
TimeMachine  logbias   latencydefault
r...@fileserver:/TimeMachine/NFS# 


r...@fileserver:/TimeMachine/NFS# zfs list TimeMachine
NAME  USED  AVAIL  REFER  

Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Jason King
It does seem to come up regularly... perhaps someone with access could
throw up a page under the ZFS community with the conclusions (and
periodic updates as appropriate)..

On Fri, Sep 25, 2009 at 3:32 AM, Erik Trimble erik.trim...@sun.com wrote:
 Nathan wrote:

 While I am about to embark on building a home NAS box using OpenSolaris
 with ZFS.

 Currently I have a chassis that will hold 16 hard drives, although not in
 caddies - down time doesn't bother me if I need to switch a drive, probably
 could do it running anyways just a bit of a pain. :)

 I am after suggestions of motherboard, CPU and ram.  Basically I want ECC
 ram and at least two PCI-E x4 channels.  As I want to run 2 x AOC-USAS_L8i
 cards for 16 drives.

 I want something with a bit of guts but over the top.  I know the HCL is
 there but I want to see what other people are using in their solutions.


 Go back and look through the archives for this list. We just had this
 discussion last month. Let's not rehash it again, as it seems to get redone
 way too often.



 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NLM_DENIED_NOLOCKS Solaris 10u5 X4500

2009-09-25 Thread Chris Banal
This was previously posed to the sun-managers mailing list but the only
reply I received recommended I post here at well.

We have a production Solaris 10u5 / ZFS X4500 file server which is
reporting  NLM_DENIED_NOLOCKS immediately for any nfs locking request. The
lockd does not appear to be busy so is it possible we have hit some sort of
limit on the number of files that can be locked? Are there any items to
check before restarting lockd / statd. This appears to have at least
temporarily cleared up the issue.

Thanks,
Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Collecting hardware configurations (was Re: White box server for OpenSolaris)

2009-09-25 Thread Cindy Swearingen

The opensolaris.org site will be transitioning to a wiki-based site
soon, as described here:

http://www.opensolaris.org/os/about/faq/site-transition-faq/

I think it would be best to use the new site to collect this
information because it will be much easier for community members
to contribute.

I'll provide a heads up when the transition, which has been delayed,
is complete.

Cindy

On 09/25/09 03:31, Eugen Leitl wrote:

On Fri, Sep 25, 2009 at 10:18:15AM +0100, Tim Foster wrote:


I don't have enough experience myself in terms of knowing what's the
best hardware on the market, but from time to time, I do think about
upgrading my system at home, and would really appreciate a
zfs-community-recommended configuration to use.

Any takers?


I'm willing to contribute (zfs on Opensolaris, mostly Supermicro
boxes and FreeNAS (FreeBSD 7.2, next 8.x probably)). Is there a 
wiki for that somewhere?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Travis Tabbal
 I am after suggestions of motherboard, CPU and ram.
 Basically I want ECC ram and at least two PCI-E x4
 channels.  As I want to run 2 x AOC-USAS_L8i cards
  for 16 drives.

Asus M4N82 Deluxe. I have one running with 2 USAS-L8i cards just fine. I don't 
have all the drives loaded in yet, but the cards are detected and they can use 
the drives I do have attached. I currently have 8GB of ECC RAM on the board and 
it's working fine. The ECC options in the BIOS are enabled and it reports the 
ECC is enabled at boot. It has 3 PCIe x16 slots, I have a graphics card in the 
other slot, and an Intel e1000g card in the PCIe x1 slot. The onboard 
peripherals all work, with the exception of the onboard AHCI ports being buggy 
in b123 under xVM. Not sure what that's all about, I posted in the main 
discussion board but haven't heard if it's a known bug or if it will be fixed 
in the next version. It would be nice as my boot drives are on that controller. 
2009.06 works fine though. CPU is a Phenom II X3 720. Probably overkill for 
fileserver duties, but I also want to do some VMs for other things, thus the 
bug I found with the xVM updates.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Thu, Sep 24, 2009 at 11:29 PM, James Lever j...@jamver.id.au wrote:

 On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:

 The commentary says that normally the COMMIT operations occur during
 close(2) or fsync(2) system call, or when encountering memory pressure.  If
 the problem is slow copying of many small files, this COMMIT approach does
 not help very much since very little data is sent per file and most time is
 spent creating directories and files.

 The problem appears to be slog bandwidth exhaustion due to all data being
 sent via the slog creating a contention for all following NFS or locally
 synchronous writes.  The NFS writes do not appear to be synchronous in
 nature - there is only a COMMIT being issued at the very end, however, all
 of that data appears to be going via the slog and it appears to be inflating
 to twice its original size.
 For a test, I just copied a relatively small file (8.4MB in size).  Looking
 at a tcpdump analysis using wireshark, there is a SETATTR which ends with a
 V3 COMMIT and no COMMIT messages during the transfer.
 iostat output that matches looks like this:
 slog write of the data (17MB appears to hit the slog)
[snip]
 then a few seconds later, the transaction group gets flushed to primary
 storage writing nearly 11.4MB which is inline with raid Z2 (expect around
 10.5MB; 8.4/8*10):
[snip]
 So I performed the same test with a much larger file (533MB) to see what it
 would do, being larger than the NVRAM cache in front of the SSD.  Note that
 after the second second of activity the NVRAM is full and only allowing in
 about the sequential write speed of the SSD (~70MB/s).
[snip]
 Again, the slog wrote about double the file size (1022.6MB) and a few
 seconds later, the data was pushed to the primary storage (684.9MB with an
 expectation of 666MB = 533MB/8*10) so again about the right number hit the
 spinning platters.
[snip]
 Can anybody explain what is going on with the slog device in that all data
 is being shunted via it and why about double the data size is being written
 to it per transaction?

By any chance do you have copies=2 set?

That will make 2 transactions of 1.

Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw

Set the PERC flush interval to say 1 second.

As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, Ross Walker wrote:


As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much


Surely this depends on the origin of the large sequential writes.  If 
the origin is NFS and the SSD has considerably more sustained write 
bandwidth than the ethernet transfer bandwidth, then using the SSD is 
a win.  If the SSD accepts data slower than the ethernet can deliver 
it (which seems to be this particular case) then the SSD is not 
helping.


If the ethernet can pass 100MB/second, then the sustained write 
specification for the SSD needs to be at least 100MB/second.  Since 
data is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it 
to ZFS, the SSD should support write bursts of at least double that or 
else it will not be helping bulk-write performance.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] selecting zfs BE from OBP

2009-09-25 Thread Donour Sizemore
Can you select the LU boot environment from sparc obp, if the  
filesystem is zfs? With ufs, you simply invoke 'boot [slice]'.


thanks

donour
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] selecting zfs BE from OBP

2009-09-25 Thread Cindy Swearingen

Hi Donour,

You would use the boot -L syntax to select the ZFS BE to boot from,
like this:

ok boot -L

Rebooting with command: boot -L
Boot device: /p...@8,60/SUNW,q...@4/f...@0,0/d...@w2104cf7fa6c7,0:a 
 File and args: -L

1 zfs1009BE
2 zfs10092BE
Select environment to boot: [ 1 - 2 ]: 2

Then copy and paste the boot string that is provided:

To boot the selected entry, invoke:
boot [root-device] -Z rpool/ROOT/zfs10092BE

Program terminated
{0} ok boot -Z rpool/ROOT/zfs10092BE

See this pointer as well:

http://docs.sun.com/app/docs/doc/819-5461/ggpco?a=view

Cindy


On 09/25/09 11:09, Donour Sizemore wrote:
Can you select the LU boot environment from sparc obp, if the filesystem 
is zfs? With ufs, you simply invoke 'boot [slice]'.


thanks

donour
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Richard Elling

On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:


On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:

On Fri, 25 Sep 2009, Ross Walker wrote:


As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much


Surely this depends on the origin of the large sequential writes.   
If the
origin is NFS and the SSD has considerably more sustained write  
bandwidth
than the ethernet transfer bandwidth, then using the SSD is a win.   
If the
SSD accepts data slower than the ethernet can deliver it (which  
seems to be

this particular case) then the SSD is not helping.

If the ethernet can pass 100MB/second, then the sustained write
specification for the SSD needs to be at least 100MB/second.  Since  
data is
buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to  
ZFS, the
SSD should support write bursts of at least double that or else it  
will not

be helping bulk-write performance.


Specifically I was talking NFS as that was what the OP was talking
about, but yes it does depend on the origin, but you also assume that
NFS IO goes over only a single 1Gbe interface when it could be over
multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
interfaces. You also assume the IO recorded in the ZIL is just the raw
IO when there is also meta-data or multiple transaction copies as
well.

Personnally I still prefer to spread the ZIL across the pool and have
a large NVRAM backed HBA as opposed to an slog which really puts all
my IO in one basket. If I had a pure NVRAM device I might consider
using that as an slog device, but SSDs are too variable for my taste.


Back of the envelope math says:
10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes
or so.

Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]
 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS flar image.

2009-09-25 Thread Peter Pickford
Hi Peter,

Do you have any notes on what you did to restore a sendfile to an existing BE?

I'm interested in creating a 'golden image' and restring this into a
new BE on a  running system as part of a hardening project.

Thanks

Peter

2009/9/14 Peter Karlsson peter.karls...@sun.com:
 Hi Greg,

 We did a hack on those lines when we installed 100 Ultra 27s that was used
 during J1, but we automated the process by using AI to install a bootstrap
 image that had a SMF service that pulled over the zfs sendfile, create a new
 BE and received the sendfile to the new BE. Work fairly OK, there where a
 few things that we to run a few scripts to fix, but at large it was
 smooth I really need to get that blog entry done :)

 /peter

 Greg Mason wrote:

 As an alternative, I've been taking a snapshot of rpool on the golden
 system, sending it to a file, and creating a boot environment from the
 archived snapshot on target systems. After fiddling with the snapshots a
 little, I then either appropriately anonymize the system or provide it with
 its identity. When it boots up, it's ready to go.

 The only downfall to my method is that I still have to run the full
 OpenSolaris installer, and I can't exclude anything in the archive.

 Essentially, it's a poor man's flash archive.

 -Greg

 cindy.swearin...@sun.com wrote:

 Hi RB,

 We have a draft of the ZFS/flar image support here:

 http://opensolaris.org/os/community/zfs/boot/flash/

 Make sure you review the Solaris OS requirements.

 Thanks,

 Cindy

 On 09/14/09 11:45, RB wrote:

 Is it possible to create flar image of ZFS root filesystem to install it
 to other macines?

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cloning Systems using zpool

2009-09-25 Thread Peter Pickford
Hi Lori,

Is the u8 flash support for the whole root pool or an individual BE
using live upgrade?

Thanks

Peter

2009/9/24 Lori Alt lori@sun.com:
 On 09/24/09 15:54, Peter Pickford wrote:

 Hi Cindy,

 Wouldn't

 touch /reconfigure
 mv /etc/path_to_inst* /var/tmp/

 regenerate all device information?


 It might, but it's hard to say whether that would accomplish everything
 needed to move a root file system from one system to another.

 I just got done modifying flash archive support to work with zfs root on
 Solaris 10 Update 8.  For those not familiar with it, flash archives are a
 way to clone full boot environments across multiple machines.  The S10
 Solaris installer knows how to install one of these flash archives on a
 system and then do all the customizations to adapt it to the  local hardware
 and local network environment.  I'm pretty sure there's more to the
 customization than just a device reconfiguration.

 So feel free to hack together your own solution.  It might work for you, but
 don't assume that you've come up with a completely general way to clone root
 pools.

 lori

 AFIK zfs doesn't care about the device names it scans for them
 it would only affect things like vfstab.

 I did a restore from a E2900 to V890 and is seemed to work

 Created the pool and zfs recieve.

 I would like to be able to have a zfs send of a minimal build and
 install it in an abe and activate it.
 I tried that is test and it seems to work.

 It seems to work but IM just wondering what I may have missed.

 I saw someone else has done this on the list and was going to write a blog.

 It seems like a good way to get a minimal install on a server with
 reduced downtime.

 Now if I just knew how to run the installer in and abe without there
 being an OS there already that would be cool too.

 Thanks

 Peter

 2009/9/24 Cindy Swearingen cindy.swearin...@sun.com:


 Hi Peter,

 I can't provide it because I don't know what it is.

 Even if we could provide a list of items, tweaking
 the device informaton if the systems are not identical
 would be too difficult.

 cs

 On 09/24/09 12:04, Peter Pickford wrote:


 Hi Cindy,

 Could you provide a list of system specific info stored in the root pool?

 Thanks

 Peter

 2009/9/24 Cindy Swearingen cindy.swearin...@sun.com:


 Hi Karl,

 Manually cloning the root pool is difficult. We have a root pool recovery
 procedure that you might be able to apply as long as the
 systems are identical. I would not attempt this with LiveUpgrade
 and manually tweaking.


 http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery

 The problem is that the amount system-specific info stored in the root
 pool and any kind of device differences might be insurmountable.

 Solaris 10 ZFS/flash archive support is available with patches but not
 for the Nevada release.

 The ZFS team is working on a split-mirrored-pool feature and that might
 be an option for future root pool cloning.

 If you're still interested in a manual process, see the steps below
 attempted by another community member who moved his root pool to a
 larger disk on the same system.

 This is probably more than you wanted to know...

 Cindy



 # zpool create -f altrpool c1t1d0s0
 # zpool set listsnapshots=on rpool
 # SNAPNAME=`date +%Y%m%d`
 # zfs snapshot -r rpool/r...@$snapname
 # zfs list -t snapshot
 # zfs send -R rp...@$snapname | zfs recv -vFd altrpool
 # installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk
 /dev/rdsk/c1t1d0s0
 for x86 do
 # installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0
 Set the bootfs property on the root pool BE.
 # zpool set bootfs=altrpool/ROOT/zfsBE altrpool
 # zpool export altrpool
 # init 5
 remove source disk (c1t0d0s0) and move target disk (c1t1d0s0) to slot0
 -insert solaris10 dvd
 ok boot cdrom -s
 # zpool import altrpool rpool
 # init 0
 ok boot disk1

 On 09/24/09 10:06, Karl Rossing wrote:


 I would like to clone the configuration on a v210 with snv_115.

 The current pool looks like this:

 -bash-3.2$ /usr/sbin/zpool status    pool: rpool
  state: ONLINE
  scrub: none requested
 config:

       NAME          STATE     READ WRITE CKSUM
       rpool         ONLINE       0     0     0
         mirror      ONLINE       0     0     0
           c1t0d0s0  ONLINE       0     0     0
           c1t1d0s0  ONLINE       0     0     0

 errors: No known data errors

 After I run zpool detach rpool c1t1d0s0, how can I remount c1t1d0s0 to
 /tmp/a so that I can make the changes I need prior to removing the drive
 and
 putting it into the new v210.

 I supose I could lucreate -n new_v210, lumount new_v210, edit what I
 need
 to, luumount new_v210, luactivate new_v210, zpool detach rpool c1t1d0s0
 and
 then luactivate the original boot environment.


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



 

[zfs-discuss] Best way to convert checksums

2009-09-25 Thread Ray Clark
What is the Best way to convert the checksums of an existing ZFS file system 
from one checksum to another?  To me Best means safest and most complete.

My zpool is 39% used, so there is plenty of space available.

Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cloning Systems using zpool

2009-09-25 Thread Lori Alt


The whole pool.  Although you can choose to exclude individual datasets 
from the flar when creating it.



lori


On 09/25/09 12:03, Peter Pickford wrote:

Hi Lori,

Is the u8 flash support for the whole root pool or an individual BE
using live upgrade?

Thanks

Peter

2009/9/24 Lori Alt lori@sun.com:
  

On 09/24/09 15:54, Peter Pickford wrote:

Hi Cindy,

Wouldn't

touch /reconfigure
mv /etc/path_to_inst* /var/tmp/

regenerate all device information?


It might, but it's hard to say whether that would accomplish everything
needed to move a root file system from one system to another.

I just got done modifying flash archive support to work with zfs root on
Solaris 10 Update 8.  For those not familiar with it, flash archives are a
way to clone full boot environments across multiple machines.  The S10
Solaris installer knows how to install one of these flash archives on a
system and then do all the customizations to adapt it to the  local hardware
and local network environment.  I'm pretty sure there's more to the
customization than just a device reconfiguration.

So feel free to hack together your own solution.  It might work for you, but
don't assume that you've come up with a completely general way to clone root
pools.

lori

AFIK zfs doesn't care about the device names it scans for them
it would only affect things like vfstab.

I did a restore from a E2900 to V890 and is seemed to work

Created the pool and zfs recieve.

I would like to be able to have a zfs send of a minimal build and
install it in an abe and activate it.
I tried that is test and it seems to work.

It seems to work but IM just wondering what I may have missed.

I saw someone else has done this on the list and was going to write a blog.

It seems like a good way to get a minimal install on a server with
reduced downtime.

Now if I just knew how to run the installer in and abe without there
being an OS there already that would be cool too.

Thanks

Peter

2009/9/24 Cindy Swearingen cindy.swearin...@sun.com:


Hi Peter,

I can't provide it because I don't know what it is.

Even if we could provide a list of items, tweaking
the device informaton if the systems are not identical
would be too difficult.

cs

On 09/24/09 12:04, Peter Pickford wrote:


Hi Cindy,

Could you provide a list of system specific info stored in the root pool?

Thanks

Peter

2009/9/24 Cindy Swearingen cindy.swearin...@sun.com:


Hi Karl,

Manually cloning the root pool is difficult. We have a root pool recovery
procedure that you might be able to apply as long as the
systems are identical. I would not attempt this with LiveUpgrade
and manually tweaking.


http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery

The problem is that the amount system-specific info stored in the root
pool and any kind of device differences might be insurmountable.

Solaris 10 ZFS/flash archive support is available with patches but not
for the Nevada release.

The ZFS team is working on a split-mirrored-pool feature and that might
be an option for future root pool cloning.

If you're still interested in a manual process, see the steps below
attempted by another community member who moved his root pool to a
larger disk on the same system.

This is probably more than you wanted to know...

Cindy



# zpool create -f altrpool c1t1d0s0
# zpool set listsnapshots=on rpool
# SNAPNAME=`date +%Y%m%d`
# zfs snapshot -r rpool/r...@$snapname
# zfs list -t snapshot
# zfs send -R rp...@$snapname | zfs recv -vFd altrpool
# installboot -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk
/dev/rdsk/c1t1d0s0
for x86 do
# installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t1d0s0
Set the bootfs property on the root pool BE.
# zpool set bootfs=altrpool/ROOT/zfsBE altrpool
# zpool export altrpool
# init 5
remove source disk (c1t0d0s0) and move target disk (c1t1d0s0) to slot0
-insert solaris10 dvd
ok boot cdrom -s
# zpool import altrpool rpool
# init 0
ok boot disk1

On 09/24/09 10:06, Karl Rossing wrote:


I would like to clone the configuration on a v210 with snv_115.

The current pool looks like this:

-bash-3.2$ /usr/sbin/zpool statuspool: rpool
 state: ONLINE
 scrub: none requested
config:

  NAME  STATE READ WRITE CKSUM
  rpool ONLINE   0 0 0
mirror  ONLINE   0 0 0
  c1t0d0s0  ONLINE   0 0 0
  c1t1d0s0  ONLINE   0 0 0

errors: No known data errors

After I run zpool detach rpool c1t1d0s0, how can I remount c1t1d0s0 to
/tmp/a so that I can make the changes I need prior to removing the drive
and
putting it into the new v210.

I supose I could lucreate -n new_v210, lumount new_v210, edit what I
need
to, luumount new_v210, luactivate new_v210, zpool detach rpool c1t1d0s0
and
then luactivate the original boot environment.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Frank Middleton

On 09/25/09 11:08 AM, Travis Tabbal wrote:

... haven't heard if it's a known
bug or if it will be fixed in the next version...


Out of courtesy to our host, Sun makes some quite competitive
X86 hardware. I have absolutely no idea how difficult it is
to buy Sun machines retail, but it seems they might be missing
out on an interesting market - robust and scalable SOHO servers
for the DYI gang - certainly OEMS like us recommend them,
although there doesn't seem to be a single-box file+application
server in the lineup which might be a disadvantage to some.

Also, assuming Oracle keeps the product line going, we plan to
give them a serious look when we finally have to replace those
sturdy old SPARCS. Unfortunately there aren't entry level SPARCs
in the lineup, but sadly there probably isn't a big enough market
to justify them and small developers don't need the big iron.

It would be interesting to hear from Sun if they have any specific
recommendations for the use of Suns for the DYI SOHO market; AFAIK
it is the profits from hardware that are going a long way to support
Sun's support of FOSS that we are all benefiting from, and there's
a good bet that OpenSolaris will run well on Sun hardware :-)

Cheers -- Frank
 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New to ZFS: One LUN, multiple zones

2009-09-25 Thread Mārcis Lielturks
2009/9/24 Robert Milkowski mi...@task.gda.pl

 Mike Gerdts wrote:

 On Wed, Sep 23, 2009 at 7:32 AM, bertram fukuda bertram.fuk...@hp.com
 wrote:


 Thanks for the info Mike.

 Just so I'm clear.  You suggest 1)create a single zpool from my LUN 2)
 create a single ZFS filesystem 3) create 2 zone in the ZFS filesystem. Sound
 right?



 Correct




 Well I would actually recommend to create a dedicate zfs file system for
 each zone (which zoneadm should do for you anyway). The reason is that it is
 much easier then to get information on how much storage each zone is using,
 you can set a quote or reservation for storage for each zone independently,
 you can easily clone each zone, snapshot it, etc.


Another thing. If you will use live upgrade (and as I understand then pkg
image-update does that seamlessly) then besides putting each zone on its
own filesystem you also should add another two datasets to be delegated to
zones where they can store their data. This would ensure that during LU you
don't boot up with a bit old data in zones. For example, this could be very
important on mail servers so you don't forget some new mails in spool
directories which arrived after creation of new environment, but before
reboot.



 --
 Robert Milkowski
 http://milek.blogspot.com


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Abrahams

Hi,

Since I don't even have a mirror for my root pool rpool, I'd like to
move as much of my system as possible over to my raidz2 pool, tank.
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best way to convert checksums

2009-09-25 Thread Ray Clark
I didn't want my question to lead to an answer, but perhaps I should have put 
more information.  My idea is to copy the file system with one of the following:
   cp -rp
   zfs send | zfs receive
   tar
   cpio
But I don't know what would be the best.

Then I would do a diff -r on them before deleting the old.

I don't know the obscure (for me) secondary things like attributes, links, 
extended modes, etc.

Thanks again.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Robert Milkowski

Chris Kirby wrote:

On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote:


Hi,

I have a zfs send command failing for some reason...


# uname -a
SunOS  5.11 snv_123 i86pc i386 i86pc Solaris

# zfs send -R -I 
archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50 
archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59 
/dev/null
cannot hold 
'archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50': 
pool must be upgraded
cannot hold 
'archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59': 
pool must be upgraded
cannot hold 
'archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14': 
pool must be upgraded
cannot hold 
'archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59': 
pool must be upgraded



# zfs list -r -t all archive-1/archive/
NAME
USED  AVAIL  REFER  MOUNTPOINT
archive-1/archive/   
65.6G  7.69T  8.69G  /archive-1/archive/
archive-1/archive/x...@rsync-2009-04-21_14:52--2009-04-21_15:13  
11.9G  -  12.0G  -
archive-1/archive/x...@rsync-2009-05-01_07:45--2009-05-01_08:06  
12.0G  -  12.1G  -
archive-1/archive/x...@rsync-2009-06-01_07:45--2009-06-01_08:50  
12.2G  -  12.3G  -
archive-1/archive/x...@rsync-2009-07-01_07:45--2009-07-01_07:59  
8.26G  -  8.37G  -
archive-1/archive/x...@rsync-2009-08-01_07:45--2009-08-01_10:14  
12.6G  -  12.7G  -
archive-1/archive/x...@rsync-2009-09-01_07:45--2009-09-01_07:59  
0  -  8.69G  -



The pool is at version 14 and all file systems are at version 3.


Ahhh... if -R is provided zfs send now calls zfs_hold_range() which 
later fails in dsl_dataset_user_hold_check() as it checks if dataset 
is not below SPA_VERSION_USERREFS which is defined as SPA_VERSION_18 
and in my case it is 14 so it fails.


But I don't really want to upgrade to version 18 as then I won't be 
able to reboot back to snv_111b (which supports up-to version 14 
only). I guess if I would use libzfs from older build it would work 
as keeping a user hold is not really required...


I can understand why it was introduced I'm just unhappy that I can't 
do zfs send -R -I now without upgrading a pool


Probably no point sending the email, as I was looking at the code and 
dtracing while writing it, but since I've written it I will post it. 
Maybe someone will find it useful.


Robert,
  That's useful information indeed.  I've filed this CR:

6885860 zfs send shouldn't require support for snapshot holds

Sorry for the trouble, please look for this to be fixed soon.

Thank you.
btw: how do you want to fix it? Do you want to acquire  a snapshot hold 
but continue anyway if it is not possible (only in case whene error is 
ENOTSUP I think)? Or do you want to get rid of it entirely?



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Chris Kirby

On Sep 25, 2009, at 2:43 PM, Robert Milkowski wrote:


Chris Kirby wrote:

On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote:

 That's useful information indeed.  I've filed this CR:

6885860 zfs send shouldn't require support for snapshot holds

Sorry for the trouble, please look for this to be fixed soon.

Thank you.
btw: how do you want to fix it? Do you want to acquire  a snapshot  
hold but continue anyway if it is not possible (only in case whene  
error is ENOTSUP I think)? Or do you want to get rid of it entirely?



In this particular case, we should make sure the pool version supports  
snapshot

holds before trying to request (or release) any.

We still want to acquire the temporary holds if we can, since that
prevents a race with zfs destroy.  That case is becoming more common
with automated snapshots and their associated retention policies.

-Chris

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Cindy Swearingen

Hi David,

All system-related components should remain in the root pool, such as
the components needed for booting and running the OS.

If you have datasets like /export/home or other non-system-related
datasets in the root pool, then feel free to move them out.

Moving OS components out of the root pool is not tested by us and I've
heard of one example recently of breakage when usr and var were moved
to a non-root RAIDZ pool.

It would be cheaper and easier to buy another disk to mirror your root
pool then it would be to take the time to figure out what could move out
and then possibly deal with an unbootable system.

Buy another disk and we'll all sleep better.

Cindy

On 09/25/09 13:35, David Abrahams wrote:

Hi,

Since I don't even have a mirror for my root pool rpool, I'd like to
move as much of my system as possible over to my raidz2 pool, tank.
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Abrahams

on Fri Sep 25 2009, Cindy Swearingen Cindy.Swearingen-AT-Sun.COM wrote:

 Hi David,

 All system-related components should remain in the root pool, such as
 the components needed for booting and running the OS.

Yes, of course.  But which *are* those?

 If you have datasets like /export/home or other non-system-related
 datasets in the root pool, then feel free to move them out.

Well, for example, surely /opt can be moved?

 Moving OS components out of the root pool is not tested by us and I've
 heard of one example recently of breakage when usr and var were moved
 to a non-root RAIDZ pool.

 It would be cheaper and easier to buy another disk to mirror your root
 pool then it would be to take the time to figure out what could move out
 and then possibly deal with an unbootable system.

 Buy another disk and we'll all sleep better.

Easy for you to say.  There's no room left in the machine for another disk.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] extremely slow writes (with good reads)

2009-09-25 Thread Paul Archer
Since I got my zfs pool working under solaris (I talked on this list 
last week about moving it from linux  bsd to solaris, and the pain that 
was), I'm seeing very good reads, but nada for writes.


Reads:

r...@shebop:/data/dvds# rsync -aP young_frankenstein.iso /tmp
sending incremental file list
young_frankenstein.iso
^C1032421376  20%   86.23MB/s0:00:44

Writes:

r...@shebop:/data/dvds# rsync -aP /tmp/young_frankenstein.iso yf.iso
sending incremental file list
young_frankenstein.iso
^C  68976640   6%2.50MB/s0:06:42


This is pretty typical of what I'm seeing.


r...@shebop:/data/dvds# zpool status -v
  pool: datapool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
datapoolONLINE   0 0 0
  raidz1ONLINE   0 0 0
c2d0s0  ONLINE   0 0 0
c3d0s0  ONLINE   0 0 0
c4d0s0  ONLINE   0 0 0
c6d0s0  ONLINE   0 0 0
c5d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: syspool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
syspool ONLINE   0 0 0
  c0d1s0ONLINE   0 0 0

errors: No known data errors

(This is while running an rsync from a remote machine to a ZFS filesystem)
r...@shebop:/data/dvds# iostat -xn 5
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   11.14.8  395.8  275.9  5.8  0.1  364.74.3   2   5 c0d1
9.8   10.9  514.3  346.4  6.8  1.4  329.7   66.7  68  70 c5d0
9.8   10.9  516.6  346.4  6.7  1.4  323.1   66.2  67  70 c6d0
9.7   10.9  491.3  346.3  6.7  1.4  324.7   67.2  67  70 c3d0
9.8   10.9  519.9  346.3  6.8  1.4  326.7   67.2  68  71 c4d0
9.8   11.0  493.5  346.6  3.6  0.8  175.3   37.9  38  41 c2d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c0d1
   64.6   12.6 8207.4  382.1 32.8  2.0  424.7   25.9 100 100 c5d0
   62.2   12.2 7203.2  370.1 27.9  2.0  375.1   26.7  99 100 c6d0
   53.2   11.8 5973.9  390.2 25.9  2.0  398.8   30.5  98  99 c3d0
   49.4   10.6 5398.2  389.8 30.2  2.0  503.7   33.3  99 100 c4d0
   45.2   12.8 5431.4  337.0 14.3  1.0  247.3   17.9  52  52 c2d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0


Any ideas?

Paul
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Lori Alt

On 09/25/09 13:35, David Abrahams wrote:

Hi,

Since I don't even have a mirror for my root pool rpool, I'd like to
move as much of my system as possible over to my raidz2 pool, tank.
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.

  

The list of datasets in a root pool should look something like this:

rpool
rpool/ROOT   
rpool/ROOT/snv_124  (or whatever version you're running)
rpool/ROOT/snv_124/var   (you might not have this) 
rpool/ROOT/snv_121  (or whatever other BEs you still have)   
rpool/dump   
rpool/export 
rpool/export/home
rpool/swap 

plus any other datasets you might have added.  Datasets you've added in 
addition to the above (unless they are zone roots under 
rpool/ROOT/be-name ) can be moved to another pool.  Anything you have 
in /export or /export/ home can be moved to another pool.  Everything 
else needs to stay in the root pool.  Yes, there are contents of the 
above datasets that could be moved and  your system would still run 
(you'd have to play with mount points or symlinks to get them included 
in the Solaris name space), but such a configuration would be 
non-standard, unsupported, and probably not upgradeable.


lori


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] extremely slow writes (with good reads)

2009-09-25 Thread Paul Archer
Oh, for the record, the drives are 1.5TB SATA, in a 4+1 raidz-1 config. 
All the drives are on the same LSI 150-6 PCI controller card, and the M/B 
is a generic something or other with a triple-core, and 2GB RAM.


Paul


3:34pm, Paul Archer wrote:

Since I got my zfs pool working under solaris (I talked on this list last 
week about moving it from linux  bsd to solaris, and the pain that was), I'm 
seeing very good reads, but nada for writes.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Lori Alt

I have no idea why that last mail lost its line feeds.   Trying again:




On 09/25/09 13:35, David Abrahams wrote:

Hi,

Since I don't even have a mirror for my root pool rpool, I'd like to
move as much of my system as possible over to my raidz2 pool, tank.
Can someone tell me which parts need to stay in rpool in order for the
system to work normally?

Thanks.

  

The list of datasets in a root pool should look something like this:


rpool   
rpool/ROOT
rpool/ROOT/snv_124  (or whatever version you're running)

rpool/ROOT/snv_124/var   (you might not have this)
rpool/ROOT/snv_121  (or whatever other BEs you still have)
rpool/dump  
rpool/export
rpool/export/home   
rpool/swap



plus any other datasets you might have added.  Datasets you've added in 
addition to the above (unless they are zone roots under 
rpool/ROOT/be-name ) can be moved to another pool.  Anything you have 
in /export or /export/ home can be moved to another pool.  Everything 
else needs to stay in the root pool.  Yes, there are contents of the 
above datasets that could be moved and  your system would still run 
(you'd have to play with mount points or symlinks to get them included 
in the Solaris name space), but such a configuration would be 
non-standard, unsupported, and probably not upgradeable.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Glenn Lagasse
* David Abrahams (d...@boostpro.com) wrote:
 
 on Fri Sep 25 2009, Cindy Swearingen Cindy.Swearingen-AT-Sun.COM wrote:
 
  Hi David,
 
  All system-related components should remain in the root pool, such as
  the components needed for booting and running the OS.
 
 Yes, of course.  But which *are* those?
 
  If you have datasets like /export/home or other non-system-related
  datasets in the root pool, then feel free to move them out.
 
 Well, for example, surely /opt can be moved?

Don't be so sure.

  Moving OS components out of the root pool is not tested by us and I've
  heard of one example recently of breakage when usr and var were moved
  to a non-root RAIDZ pool.
 
  It would be cheaper and easier to buy another disk to mirror your root
  pool then it would be to take the time to figure out what could move out
  and then possibly deal with an unbootable system.
 
  Buy another disk and we'll all sleep better.
 
 Easy for you to say.  There's no room left in the machine for another disk.

The question you're asking can't easily be answered.  Sun doesn't test
configs like that.  If you really want to do this, you'll pretty much
have to 'try it and see what breaks'.  And you get to keep both pieces
if anything breaks.

There's very little you can safely move in my experience.  /export
certainly.  Anything else, not really (though ymmv).  I tried to create
a seperate zfs dataset for /usr/local.  That worked some of the time,
but it also screwed up my system a time or two during
image-updates/package installs.

On my 2010.02/123 system I see:

bin Symlink to /usr/bin
boot/
dev/
devices/
etc/
export/ Safe to move, not tied to the 'root' system
kernel/
lib/
media/
mnt/
net/
opt/
platform/
proc/
rmdisk/
root/   Could probably move root's homedir
rpool/
sbin/
system/
tmp/
usr/
var/

Other than /export, everything else is considered 'part of the root
system'.  Thus part of the root pool.

Really, if you can't add a mirror for your root pool, then make backups
of your root pool (left as an exercise to the reader) and store the
non-system specific bits (/export) on you're raidz2 pool.

Cheers,

-- 
Glenn
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Toby Thain


On 25-Sep-09, at 2:58 PM, Frank Middleton wrote:


On 09/25/09 11:08 AM, Travis Tabbal wrote:

... haven't heard if it's a known
bug or if it will be fixed in the next version...


Out of courtesy to our host, Sun makes some quite competitive
X86 hardware. I have absolutely no idea how difficult it is
to buy Sun machines retail,


Not very difficult. And there is try and buy.

People overestimate the cost of Sun, and underestimate the real value  
of fully integrated.


--Toby


but it seems they might be missing
out on an interesting market - robust and scalable SOHO servers
for the DYI gang ...

Cheers -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS ARC vs Oracle cache

2009-09-25 Thread Christo Kutrovsky
Hi,

Definitely large SGA, small arc. In fact, it's best to disable the ARC 
altogether for the Oracle filesystems.

Blocks in the db_cache (oracle cache) can be used as is while cached data 
from ARC needs significant CPU processing before it's inserted back into the 
db_cache.

Not to mention that block in db_cache can remain dirty for longer periods, 
saving disk writes.

But definetelly:
- separate redo disk (preferably dedicated disk/pool)
- your ZFS filesystem needs to match the oracle block size (8Kb default)

With your configuration, and assuming nothing else (but oracle database server) 
on the system, a db_cache size in the 70 GiB range would be perfectly 
acceptable.

Don't forget to set pga_aggregate_target to something reasonable too, like 20 
GiB.

Christo Kutrovsky
Senior DBA
The Pythian Group
I Blog at: www.pythian.com/news
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Peter Pickford
Hi David,

I believe /opt is an essential file system as it contains software
that is maintained by the packaging system.
In fact anywhere you install software via pkgadd probably should be in
the BE under /rpool/ROOT/bename

AFIK it should not even be split from root in the BE under zfs boot
(only /var is supported) other wise LU breaks.

I have sub directories of /opt like /aop/app which does not contain
software installed via pkgadd.

I also split off /var/core and /var/crash.

Unfortunately when you need to boot -F and import the pool for
maintenance it doesn't mount /var causing directory /var/core and
/var/crash to be created in the root file system.

The system then reboots but when you do a lucreate, or lumount it
fails due to /var/core and /var/crash existing on the / file system
causing the mount of /var to fail in the ABE.

I have found it a bit problematic to split of file systems from /
under zfs boot and still have LU work properly.

I haven't tried putting split off file systems as apposed to
application file systems on a different pool but I believe there may
be mount ordering issues with mounting dependent file systems from
different pools where the parent file system are not part of the BE or
legacy mounts.

It is not possible to mount a vxfs file system under a non legacy zone
root file system due to ordering issues with mounting on boot (legacy
is done before automatic zfs mounts).

Perhaps u7 addressed some of there issues as I believe it is now
allowable to have zone root file system on a non root pool.

These are just my experiences and I'm sure others can give more
definitive answers.
Perhaps its easier to get some bigger disks.

Thanks

Peter

2009/9/25 David Abrahams d...@boostpro.com:

 on Fri Sep 25 2009, Cindy Swearingen Cindy.Swearingen-AT-Sun.COM wrote:

 Hi David,

 All system-related components should remain in the root pool, such as
 the components needed for booting and running the OS.

 Yes, of course.  But which *are* those?

 If you have datasets like /export/home or other non-system-related
 datasets in the root pool, then feel free to move them out.

 Well, for example, surely /opt can be moved?

 Moving OS components out of the root pool is not tested by us and I've
 heard of one example recently of breakage when usr and var were moved
 to a non-root RAIDZ pool.

 It would be cheaper and easier to buy another disk to mirror your root
 pool then it would be to take the time to figure out what could move out
 and then possibly deal with an unbootable system.

 Buy another disk and we'll all sleep better.

 Easy for you to say.  There's no room left in the machine for another disk.

 --
 Dave Abrahams
 BoostPro Computing
 http://www.boostpro.com

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Magda

On Sep 25, 2009, at 16:39, Glenn Lagasse wrote:


There's very little you can safely move in my experience.  /export
certainly.  Anything else, not really (though ymmv).  I tried to  
create

a seperate zfs dataset for /usr/local.  That worked some of the time,
but it also screwed up my system a time or two during
image-updates/package installs.


I'd be very surprised (disappointed?) if /usr/local couldn't be  
detached from the rpool. Given that in many cases it's an NFS mount,  
I'm curious to know why it would need to be part of the rpool. If it  
is a 'dependency' I would consider that a bug.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread James Lever


On 26/09/2009, at 1:14 AM, Ross Walker wrote:


By any chance do you have copies=2 set?


No, only 1.  So the double data going to the slog (as reported by  
iostat) is still confusing me and clearly potentially causing  
significant harm to my performance.



Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw


That’s an interesting concept.  All data still appears to go via the  
slog device, however, under heavy load my responsive to a new write is  
typically below 2s (a few outliers at about 3.5s) and a read  
(directory listing of a non-cached entry) is about 2s.


What will this do once it hits the limit?  Will streaming writes now  
be sent directly to a txg and streamed to the primary storage  
devices?  (that is what I would like to see happen).



As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.



It seems that it doesn’t matter what the workload is if the NFS pipe  
can sustain more continuous throughput the slog chain can support.


I suppose some creative use of the logbias setting might assist this  
situation and force all potentially heavy writers directly to the  
primary storage.  This would, however, negate any benefit for having a  
fast, low latency device for those filesystems for the times when it  
is desirable (any large batch of small writes, for example).


Is there a way to have a dynamic, auto logbias type setting depending  
on the transaction currently presented to the server such that if it  
is clearly a large streaming write it gets treated as  
logbias=throughput and if it is a small transaction it gets treated as  
logbias=latency?  (i.e. such that NFS transactions can be effectively  
treated as if it was local storage but minorly breaking the benefits  
of the txg scheduling).


On 26/09/2009, at 3:39 AM, Richard Elling wrote:


Back of the envelope math says:
10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
1 GByte/sec * 30 sec = 30 GBytes

Ross' idea has merit, if the size of the NVRAM in the array is 30  
GBytes

or so.


At this point, enter the fusionIO cards or similar devices.   
Unfortunately there does not seem to be anything on the market with  
infinitely fast write capacity (memory speeds) that is also supported  
under OpenSolaris as a slog device.


I think this is precisely what I (and anybody running a general  
purpose NFS server) need for a general purpose slog device.



Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob's recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn't
quite perfect yet. [Cue for Neil to chime in... :-)]


How does reducing the txg commit interval really help?  WIll data no  
longer go via the slog once it is streaming to disk?  or will data  
still all be pushed through the slog regardless?


For a predominantly NFS server purpose, it really looks like a case of  
the slog has to outperform your main pool for continuous write speed  
as well as an instant response time as the primary criterion. Which  
might as well be a fast (or group of fast) SSDs or 15kRPM drives with  
some NVRAM in front of them.


Is there also a way to throttle synchronous writes to the slog  
device?  Much like the ZFS write throttling that is already  
implemented, so that there is a gap for new writers to enter when  
writing to the slog device? (or is this the norm and includes slog  
writes?)


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Fri, Sep 25, 2009 at 5:24 PM, James Lever j...@jamver.id.au wrote:

 On 26/09/2009, at 1:14 AM, Ross Walker wrote:

 By any chance do you have copies=2 set?

 No, only 1.  So the double data going to the slog (as reported by iostat) is
 still confusing me and clearly potentially causing significant harm to my
 performance.

Weird then, I thought that would be an easy explaination.

 Also, try setting zfs_write_limit_override equal to the size of the
 NVRAM cache (or half depending on how long it takes to flush):

 echo zfs_write_limit_override/W0t268435456 | mdb -kw

 That’s an interesting concept.  All data still appears to go via the slog
 device, however, under heavy load my responsive to a new write is typically
 below 2s (a few outliers at about 3.5s) and a read (directory listing of a
 non-cached entry) is about 2s.

 What will this do once it hits the limit?  Will streaming writes now be sent
 directly to a txg and streamed to the primary storage devices?  (that is
 what I would like to see happen).

It's sets the max size of a txg to the given size. When it hits that
number it flushes to disk.

 As a side an slog device will not be too beneficial for large
 sequential writes, because it will be throughput bound not latency
 bound. slog devices really help when you have lots of small sync
 writes. A RAIDZ2 with the ZIL spread across it will provide much
 higher throughput then an SSD. An example of a workload that benefits
 from an slog device is ESX over NFS, which does a COMMIT for each
 block written, so it benefits from an slog, but a standard media
 server will not (but an L2ARC would be beneficial).

 Better workload analysis is really what it is about.


 It seems that it doesn’t matter what the workload is if the NFS pipe can
 sustain more continuous throughput the slog chain can support.

Only on large sequentials, small sync IO should benefit from the slog.

 I suppose some creative use of the logbias setting might assist this
 situation and force all potentially heavy writers directly to the primary
 storage.  This would, however, negate any benefit for having a fast, low
 latency device for those filesystems for the times when it is desirable (any
 large batch of small writes, for example).

 Is there a way to have a dynamic, auto logbias type setting depending on the
 transaction currently presented to the server such that if it is clearly a
 large streaming write it gets treated as logbias=throughput and if it is a
 small transaction it gets treated as logbias=latency?  (i.e. such that NFS
 transactions can be effectively treated as if it was local storage but
 minorly breaking the benefits of the txg scheduling).

I'll leave that to the Sun guys to answer.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Enrico Maria Crisostomo
On Fri, Sep 25, 2009 at 10:56 PM, Toby Thain t...@telegraphics.com.au wrote:

 On 25-Sep-09, at 2:58 PM, Frank Middleton wrote:

 On 09/25/09 11:08 AM, Travis Tabbal wrote:

 ... haven't heard if it's a known
 bug or if it will be fixed in the next version...

 Out of courtesy to our host, Sun makes some quite competitive
 X86 hardware. I have absolutely no idea how difficult it is
 to buy Sun machines retail,

 Not very difficult. And there is try and buy.
Indeed, at least in Spain and in Italy I had no problem buying
workstations. Recently I owned both Sun Ultra 20 M2 and Ultra 24. I
had a great feeling with them and price seemed very competitive to me,
compared to offers of other mainstream hardware providers.


 People overestimate the cost of Sun, and underestimate the real value of
 fully integrated.
+1. People like fully integration when it comes, for example, to
Apple, iPods and iPhones. When it comes, just to make another
example...,  to Solaris, ZFS, ECC memory and so forth (do you remember
those posts some time ago?), they quickly forget.


 --Toby

 but it seems they might be missing
 out on an interesting market - robust and scalable SOHO servers
 for the DYI gang ...

 Cheers -- Frank


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Ελευθερία ή θάνατος
Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs, and the Universe trying
to produce bigger and better idiots. So far, the Universe is winning.
GPG key: 1024D/FD2229AF
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Fri, Sep 25, 2009 at 1:39 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:

 On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
 bfrie...@simple.dallas.tx.us wrote:

 On Fri, 25 Sep 2009, Ross Walker wrote:

 As a side an slog device will not be too beneficial for large
 sequential writes, because it will be throughput bound not latency
 bound. slog devices really help when you have lots of small sync
 writes. A RAIDZ2 with the ZIL spread across it will provide much

 Surely this depends on the origin of the large sequential writes.  If the
 origin is NFS and the SSD has considerably more sustained write bandwidth
 than the ethernet transfer bandwidth, then using the SSD is a win.  If
 the SSD accepts data slower than the ethernet can deliver it (which seems to
 be this particular case) then the SSD is not helping.

 If the ethernet can pass 100MB/second, then the sustained write
 specification for the SSD needs to be at least 100MB/second.  Since data
 is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the
 SSD should support write bursts of at least double that or else it will
 not be helping bulk-write performance.

 Specifically I was talking NFS as that was what the OP was talking
 about, but yes it does depend on the origin, but you also assume that
 NFS IO goes over only a single 1Gbe interface when it could be over
 multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
 interfaces. You also assume the IO recorded in the ZIL is just the raw
 IO when there is also meta-data or multiple transaction copies as
 well.

 Personnally I still prefer to spread the ZIL across the pool and have
 a large NVRAM backed HBA as opposed to an slog which really puts all
 my IO in one basket. If I had a pure NVRAM device I might consider
 using that as an slog device, but SSDs are too variable for my taste.

 Back of the envelope math says:
        10 Gbe = ~1 GByte/sec of I/O capacity

 If the SSD can only sink 70 MByte/s, then you will need:
        int(1000/70) + 1 = 15 SSDs for the slog

 For capacity, you need:
        1 GByte/sec * 30 sec = 30 GBytes

Where did the 30 seconds come in here?

The amount of time to hold cache depends on how fast you can fill it.

 Ross' idea has merit, if the size of the NVRAM in the array is 30 GBytes
 or so.

I'm thinking you can do less if you don't need to hold it for 30 seconds.

 Both of the above assume there is lots of memory in the server.
 This is increasingly becoming easier to do as the memory costs
 come down and you can physically fit 512 GBytes in a 4u server.
 By default, the txg commit will occur when 1/8 of memory is used
 for writes. For 30 GBytes, that would mean a main memory of only
 240 Gbytes... feasible for modern servers.

 However, most folks won't stomach 15 SSDs for slog or 30 GBytes of
 NVRAM in their arrays. So Bob's recommendation of reducing the
 txg commit interval below 30 seconds also has merit.  Or, to put it
 another way, the dynamic sizing of the txg commit interval isn't
 quite perfect yet. [Cue for Neil to chime in... :-)]

I'm sorry did I miss something Bob said about the txg commit interval?

I looked back and didn't see it, maybe it was off-list?

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson
j...@jamver.id.au said:
 For a predominantly NFS server purpose, it really looks like a case of the
 slog has to outperform your main pool for continuous write speed as well as
 an instant response time as the primary criterion. Which might as well be a
 fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
 them. 

I wonder if you ran Richard Elling's zilstat while running your
workload.  That should tell you how much ZIL bandwidth is needed,
and it would be interesting to see if its stats match with your
other measurements of slog-device traffic.

I did some filebench and tar extract over NFS tests of J4400 (500GB,
7200RPM SATA drives), with and without slog, where slog was using the
internal 2.5 10kRPM SAS drives in an X4150.  These drives were behind
the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
cache memory, all on Solaris-10U7.

We saw slight differences on filebench oltp profile, and a huge speedup
for the tar extract over NFS tests with the slog present.  Granted, the
latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
good results for a poor-person's slog, though:
http://acc.ohsu.edu/~hakansom/j4400_bench.html

Just as an aside, and based on my experience as a user/admin of various
NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
to get very good improvements with relatively small amounts of NVRAM
(128K, 1MB, 256MB, etc.).  None of the filers I've seen have ever had
tens of GB of NVRAM.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot hold 'xxx': pool must be upgraded

2009-09-25 Thread Robert Milkowski

Chris Kirby wrote:

On Sep 25, 2009, at 2:43 PM, Robert Milkowski wrote:


Chris Kirby wrote:

On Sep 25, 2009, at 11:54 AM, Robert Milkowski wrote:

 That's useful information indeed.  I've filed this CR:

6885860 zfs send shouldn't require support for snapshot holds

Sorry for the trouble, please look for this to be fixed soon.

Thank you.
btw: how do you want to fix it? Do you want to acquire  a snapshot 
hold but continue anyway if it is not possible (only in case whene 
error is ENOTSUP I think)? Or do you want to get rid of it entirely?



In this particular case, we should make sure the pool version supports 
snapshot

holds before trying to request (or release) any.

We still want to acquire the temporary holds if we can, since that
prevents a race with zfs destroy.  That case is becoming more common
with automated snapshots and their associated retention policies.



Yeah, this makes sense.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker
On Fri, Sep 25, 2009 at 5:47 PM, Marion Hakanson hakan...@ohsu.edu wrote:
 j...@jamver.id.au said:
 For a predominantly NFS server purpose, it really looks like a case of the
 slog has to outperform your main pool for continuous write speed as well as
 an instant response time as the primary criterion. Which might as well be a
 fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
 them.

 I wonder if you ran Richard Elling's zilstat while running your
 workload.  That should tell you how much ZIL bandwidth is needed,
 and it would be interesting to see if its stats match with your
 other measurements of slog-device traffic.

Yes, but if it's on NFS you can just figure out the workload in MB/s
and use that as a rough guideline.

Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.

 I did some filebench and tar extract over NFS tests of J4400 (500GB,
 7200RPM SATA drives), with and without slog, where slog was using the
 internal 2.5 10kRPM SAS drives in an X4150.  These drives were behind
 the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
 cache memory, all on Solaris-10U7.

 We saw slight differences on filebench oltp profile, and a huge speedup
 for the tar extract over NFS tests with the slog present.  Granted, the
 latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
 good results for a poor-person's slog, though:
        http://acc.ohsu.edu/~hakansom/j4400_bench.html

I did a smiliar test with a 512MB BBU controller and saw no difference
with or without the SSD slog, so I didn't end up using it.

Does your BBU controller ignore the ZFS flushes?

 Just as an aside, and based on my experience as a user/admin of various
 NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
 to get very good improvements with relatively small amounts of NVRAM
 (128K, 1MB, 256MB, etc.).  None of the filers I've seen have ever had
 tens of GB of NVRAM.

They don't hold on to the cache for a long time, just as long as it
takes to write it all to disk.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Glenn Lagasse
* David Magda (dma...@ee.ryerson.ca) wrote:
 On Sep 25, 2009, at 16:39, Glenn Lagasse wrote:
 
 There's very little you can safely move in my experience.  /export
 certainly.  Anything else, not really (though ymmv).  I tried to
 create
 a seperate zfs dataset for /usr/local.  That worked some of the time,
 but it also screwed up my system a time or two during
 image-updates/package installs.
 
 I'd be very surprised (disappointed?) if /usr/local couldn't be
 detached from the rpool. Given that in many cases it's an NFS mount,
 I'm curious to know why it would need to be part of the rpool. If it
 is a 'dependency' I would consider that a bug.

It can be detached, however one issue I ran in to was packages which
installed into /usr/local caused problems when those packages were
upgraded.  Essentially what occurred was that /usr/local was created on
the root pool and upon reboot caused the filesystem service to go into
maintenance because it couldn't mount the zfs /usr/local dataset on top
of the filled /usr/local root pool location.  I didn't have time to
investigate into it fully.  At that point, spinning /usr/local off into
it's own zfs dataset just didn't seem worth the hassle.

Others mileage may vary.

-- 
Glenn
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, Richard Elling wrote:

By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.


Ahem.  We were advised that 7/8s of memory is currently what is 
allowed for writes.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Marion Hakanson
rswwal...@gmail.com said:
 Yes, but if it's on NFS you can just figure out the workload in MB/s and use
 that as a rough guideline. 

I wonder if that's the case.  We have an NFS server without NVRAM cache
(X4500), and it gets huge MB/sec throughput on large-file writes over NFS.
But it's painfully slow on the tar extract lots of small files test,
where many, tiny, synchronous metadata operations are performed.


 I did a smiliar test with a 512MB BBU controller and saw no difference with
 or without the SSD slog, so I didn't end up using it.
 
 Does your BBU controller ignore the ZFS flushes? 

I believe it does (it would be slow otherwise).  It's the Sun StorageTek
internal SAS RAID HBA.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Frank Middleton

On 09/25/09 04:44 PM, Lori Alt wrote:


rpool
rpool/ROOT
rpool/ROOT/snv_124 (or whatever version you're running)
rpool/ROOT/snv_124/var (you might not have this)
rpool/ROOT/snv_121 (or whatever other BEs you still have)
rpool/dump
rpool/export
rpool/export/home
rpool/swap


Unless you machine is so starved for physical memory that
you couldn't possibly install anything, AFAIK you can always
boot without dump and swap, so even if your data pool can't
be mounted, you should be OK. I've done many a reboot and
pkg image-update with dump and swap inaccessible. Of course
with no dump, you won't get, well, a dump, after a panic...

Having /usr/local (IIRC this doesn't even exist in a straight
OpenSolaris install) in a shared space on your data pool is
quite useful if you have more than one machine unless you have
multiple architectures. Then it turns into the /opt problem.

Hiving off /opt does not seem to prevent booting, and having
it on a data pool doesn't seem to prevent upgrade installs.
The big problem with putting /opt on a shared pool is when
multiple hosts have different /opts. Using legacy mounts seems
to be the only way around this. Do the gurus have a technical
explanation why putting /opt in a different pool shouldn't work?

/var/tmp is a strange beast. It can get quite large, and be a
serious bottleneck if mapped to a physical disk and used by any
program that synchronously creates and deletes large numbers of
files. I have had no problems mapping /var/tmp to /tmp. Hopefully
a guru will step in here and explain why this is a bad idea, but
so far no problems...

A 32GB SSD is marginal for a root pool, so shrinking it as much
as possible makes a lot of sense until bigger SSDS become cost
effective (not long from now I imagine). But if you already have
a 16GB or 32GB SSD, or a dedicated boot disk = 32GB than you
can be SOL unless you are very careful to empty /var/pkg/download,
which doesn't seem to get emptied even if you set the magic flag.

HTH -- Frank


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Erik Trimble
From a product standpoint, expanding the variety available in the 
Storage 7000 (Amber Road) line is somewhere I think we'd (Sun) make bank on.


Things like:

[ for the home/very small business market ]
Mini-Tower sized case, 4-6 3.5 HS SATA-only bays (to take the 
X2200-style spud bracket drives), 2 CF slots (for boot), single-socket, 
with 4 DIMMs, and a built-in ILOM.  /maybe/ a x4 PCI-E slot, but maybe not.


[ for the small business/branch office with no racks]
Mid-tower case, 4-bay 2.5 HS area, 6-8 bay 3.5 HS area, single socket, 
4/6 DIMMs, ILOM.  (2) x4 or x8 PCI-E slots too.



(I'd probably go with Socket AM3, with ECC, of course)


I'd sell them in both fully loaded with the Amber Road software (and 
mandatory Service Contract), and no-OS Loaded, no-Service Contract 
appliance versions.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] selecting zfs BE from OBP

2009-09-25 Thread Donour Sizemore

Ah yes.

Thanks Cindy!

donour


On Sep 25, 2009, at 10:37 AM, Cindy Swearingen wrote:


Hi Donour,

You would use the boot -L syntax to select the ZFS BE to boot from,
like this:

ok boot -L

Rebooting with command: boot -L
Boot device: /p...@8,60/SUNW,q...@4/f...@0,0/ 
d...@w2104cf7fa6c7,0:a  File and args: -L

1 zfs1009BE
2 zfs10092BE
Select environment to boot: [ 1 - 2 ]: 2

Then copy and paste the boot string that is provided:

To boot the selected entry, invoke:
boot [root-device] -Z rpool/ROOT/zfs10092BE

Program terminated
{0} ok boot -Z rpool/ROOT/zfs10092BE

See this pointer as well:

http://docs.sun.com/app/docs/doc/819-5461/ggpco?a=view

Cindy


On 09/25/09 11:09, Donour Sizemore wrote:
Can you select the LU boot environment from sparc obp, if the  
filesystem is zfs? With ufs, you simply invoke 'boot [slice]'.

thanks
donour
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Abrahams

on Fri Sep 25 2009, Glenn Lagasse Glenn.Lagasse-AT-Sun.COM wrote:

 The question you're asking can't easily be answered.  Sun doesn't test
 configs like that.  If you really want to do this, you'll pretty much
 have to 'try it and see what breaks'.  And you get to keep both pieces
 if anything breaks.

Heh, that doesn't sound like much fun.  I have a VM I can experiment
with, but I don't want to do this badly enough to take that risk.

 There's very little you can safely move in my experience.  /export
 certainly.  Anything else, not really (though ymmv).  I tried to create
 a seperate zfs dataset for /usr/local.  That worked some of the time,
 but it also screwed up my system a time or two during
 image-updates/package installs.

That's hard to imagine.  My OpenSolaris installation didn't come with a
/usr/local directory.  How can mounting a filesystem from a non-root
pool under /usr possibly mess anything up?

 On my 2010.02/123 system I see:

 bin Symlink to /usr/bin
 boot/
 dev/
 devices/
 etc/
 export/ Safe to move, not tied to the 'root' system

Good to know.

 kernel/
 lib/
 media/
 mnt/
 net/
 opt/
 platform/
 proc/
 rmdisk/
 root/   Could probably move root's homedir

I don't think I'd risk it.

 rpool/
 sbin/
 system/
 tmp/
 usr/
 var/

 Other than /export, everything else is considered 'part of the root
 system'.  Thus part of the root pool.

 Really, if you can't add a mirror for your root pool, then make backups
 of your root pool (left as an exercise to the reader) and store the
 non-system specific bits (/export) on you're raidz2 pool.

Yeah, that's my fallback.  Actually, that along with copies=2 on my root
pool, which I might well do anyhow.  But you people are making a pretty
strong case for making the effort to figure out how to do the mirror
thing.

Thanks, all, for the feedback.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread Bill Sommerfeld

On Fri, 2009-09-25 at 14:39 -0600, Lori Alt wrote:
 The list of datasets in a root pool should look something like this:
...
 rpool/swap  

I've had success with putting swap into other pools.  I believe others
have, as well.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Ross Walker


On Sep 25, 2009, at 6:19 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems  
wrong to me.  Previously we were advised that the slog is basically  
a log of uncommitted system calls so the size of the data chunks  
written to the slog should be similar to the data sizes in the  
system calls.


Are these not broken into recordsize chunks?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS pool replace single disk with raidz

2009-09-25 Thread Ryan Hirsch
I have a zpool named rtank.  I accidently attached a single drive to the pool.  
I am an idiot I know :D Now I want to replace this single drive with a raidz 
group.  Below is the pool setup and what I tried:
 

NAMESTATE READ WRITE CKSUM
rtank   ONLINE   0 0 0
 - raidz1ONLINE   0 0 0
   -- c4t0d0  ONLINE   0 0 0
   -- c4t1d0  ONLINE   0 0 0
   -- c4t2d0  ONLINE   0 0 0
   -- c4t3d0  ONLINE   0 0 0
   -- c4t4d0  ONLINE   0 0 0
   -- c4t5d0  ONLINE   0 0 0
   -- c4t6d0  ONLINE   0 0 0
   -- c4t7d0  ONLINE   0 0 0
 - raidz1ONLINE   0 0 0
   -- c3t0d0  ONLINE   0 0 0
   -- c3t1d0  ONLINE   0 0 0
   -- c3t2d0  ONLINE   0 0 0
   -- c3t3d0  ONLINE   0 0 0
   -- c3t4d0  ONLINE   0 0 0
   -- c3t5d0  ONLINE   0 0 0
  - c5d0  ONLINE   0 0 0  --- single drive in the pool 
not in any raidz


$ pfexec zpool replace rtank c5d0 raidz c3t6d0 c3t7d0 c3t8d0 c3t9d0 c3t10d0 
c3t11d0
too many arguments

$ zpool upgrade -v
This system is currently running ZFS pool version 18.


Is what I am trying to do possible?  If so what am I doing wrong?  Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic slow responsiveness

2009-09-25 Thread Neil Perrin



On 09/25/09 16:19, Bob Friesenhahn wrote:

On Fri, 25 Sep 2009, Ross Walker wrote:


Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.


Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.


Log blocks are variable in size dependent on what needs to be committed.
The minimum size is 4KB and the max 128KB. Log records are aggregated
and written together as much as possible.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which directories must be part of rpool?

2009-09-25 Thread David Magda

On Sep 25, 2009, at 19:39, Frank Middleton wrote:


/var/tmp is a strange beast. It can get quite large, and be a
serious bottleneck if mapped to a physical disk and used by any
program that synchronously creates and deletes large numbers of
files. I have had no problems mapping /var/tmp to /tmp. Hopefully
a guru will step in here and explain why this is a bad idea, but
so far no problems...


The contents of /var/tmp can be expected to survive between boots  
(e.g., /var/tmp/vi.recover); /tmp is nuked on power cycles (because  
it's just memory/swap):


/tmp: A directory made available for applications that need a place  
to create temporary files. Applications shall be allowed to create  
files in this directory, but shall not assume that such files are  
preserved between invocations of the application.


http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap10.html

If a program is creating and deleting large numbers of files, and  
those files aren't needed between reboots, then it really should be  
using /tmp.


Similar definition for Linux FWIW:

http://www.pathname.com/fhs/pub/fhs-2.3.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS pool replace single disk with raidz

2009-09-25 Thread Bob Friesenhahn

On Fri, 25 Sep 2009, Ryan Hirsch wrote:

I have a zpool named rtank.  I accidently attached a single drive to 
the pool.  I am an idiot I know :D Now I want to replace this single 
drive with a raidz group.  Below is the pool setup and what I tried:


I think that the best you will be able to do is the turn this single 
drive into a mirror.  It seems that this sort of human error occurs 
pretty often and there is not yet a way to properly fix it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss