Re: [zfs-discuss] partioned cache devices

2013-03-19 Thread Andrew Gabriel

Andrew Werchowiecki wrote:


 Total disk size is 9345 cylinders
 Cylinder size is 12544 (512 byte) blocks
 
   Cylinders

  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1 EFI   0  93459346100


You only have a p1 (and for a GPT/EFI labeled disk, you can only
have p1 - no other FDISK partitions are allowed).


partition print
Current partition table (original):
Total disk sectors available: 117214957 + 16384 (reserved sectors)
 
Part  TagFlag First Sector Size Last Sector

  0usrwm642.00GB  4194367
  1usrwm   4194368   53.89GB  117214990
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 1172149918.00MB  117231374


You have an s0 and s1.

This isn’t the output from when I did it but it is exactly the same 
steps that I followed.
 
Thanks for the info about slices, I may give that a go later on. I’m not 
keen on that because I have clear evidence (as in zpools set up this 
way, right now, working, without issue) that GPT partitions of the style 
shown above work and I want to see why it doesn’t work in my set up 
rather than simply ignoring and moving on.


You would have to blow away the partitioning you have, and create an FDISK
partitioned disk (not EFI), and then create a p1 and p2 partition. (Don't
use the 'partition' subcommand, which confusingly creates solaris slices.)
Give the FDISK partitions a partition type which nothing will recognise,
such as 'other', so that nothing will try and interpret them as OS partitions.
Then you can use them as raw devices, and they should be portable between
OS's which can handle FDISK partitioned devices.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] partioned cache devices

2013-03-19 Thread Andrew Gabriel

On 03/19/13 20:27, Jim Klimov wrote:

I disagree; at least, I've always thought differently:
the d device is the whole disk denomination, with a
unique number for a particular controller link (c+t).

The disk has some partitioning table, MBR or GPT/EFI.
In these tables, partition p0 stands for the table
itself (i.e. to manage partitioning),


p0 is the whole disk regardless of any partitioning.
(Hence you can use p0 to access any type of partition table.)


and the rest kind
of depends. In case of MBR tables, one partition may
be named as having a Solaris (or Solaris2) type, and
there it holds a SMI table of Solaris slices, and these
slices can hold legacy filesystems or components of ZFS
pools. In case of GPT, the GPT-partitions can be used
directly by ZFS. However, they are also denominated as
slices in ZFS and format utility.


The GPT partitioning spec requires the disk to be FDISK
partitioned with just one single FDISK partition of type EFI,
so that tools which predate GPT partitioning will still see
such a GPT disk as fully assigned to FDISK partitions, and
therefore less likely to be accidentally blown away.


I believe, Solaris-based OSes accessing a p-named
partition and an s-named slice of the same number
on a GPT disk should lead to the same range of bytes
on disk, but I am not really certain about this.


No, you'll see just p0 (whole disk), and p1 (whole disk
less space for the backwards compatible FDISK partitioning).


Also, if a whole disk is given to ZFS (and for OSes
other that the latest Solaris 11 this means non-rpool
disks), then ZFS labels the disk as GPT and defines a
partition for itself plus a small trailing partition
(likely to level out discrepancies with replacement
disks that might happen to be a few sectors too small).
In this case ZFS reports that it uses cXtYdZ as a
pool component,


For an EFI disk, the device name without a final p* or s*
component is the whole EFI partition. (It's actually the
s7 slice minor device node, but the s7 is dropped from
the device name to avoid the confusion we had with s2
on SMI labeled disks being the whole SMI partition.)


since it considers itself in charge
of the partitioning table and its inner contents, and
doesn't intend to share the disk with other usages
(dual-booting and other OSes' partitions, or SLOG and
L2ARC parts, etc). This also allows ZFS to influence
hardware-related choices, like caching and throttling,
and likely auto-expansion with the changed LUN sizes
by fixing up the partition table along the way, since
it assumes being 100% in charge of the disk.

I don't think there is a crime in trying to use the
partitions (of either kind) as ZFS leaf vdevs, even the
zpool(1M) manpage states that:

... The  following  virtual  devices  are supported:
  disk
A block device, typically located under  /dev/dsk.
ZFS  can  use  individual  slices  or  partitions,
though the recommended mode of operation is to use
whole  disks.  ...


Right.


This is orthogonal to the fact that there can only be
one Solaris slice table, inside one partition, on MBR.
AFAIK this is irrelevant on GPT/EFI - no SMI slices there.


There's a simpler way to think of it on x86.
You always have FDISK partitioning (p1, p2, p3, p4).
You can then have SMI or GPT/EFI slices (both called s0, s1, ...)
in an FDISK partition of the appropriate type.
With SMI labeling, s2 is by convention the whole Solaris FDISK
partition (although this is not enforced).
With EFI labeling, s7 is enforced as the whole EFI FDISK partition,
and so the trailing s7 is dropped off the device name for
clarity.

This simplicity is brought about because the GPT spec requires
that backwards compatible FDISK partitioning is included, but
with just 1 partition assigned.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] partioned cache devices

2013-03-18 Thread Andrew Werchowiecki
I did something like the following:

format -e /dev/rdsk/c5t0d0p0
fdisk
1 (create)
F (EFI)
6 (exit)
partition
label
1
y
0
usr
wm
64
4194367e
1
usr
wm
4194368
117214990
label
1
y



 Total disk size is 9345 cylinders
 Cylinder size is 12544 (512 byte) blocks

   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1 EFI   0  93459346100

partition print
Current partition table (original):
Total disk sectors available: 117214957 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
  0usrwm642.00GB  4194367
  1usrwm   4194368   53.89GB  117214990
  2 unassignedwm 0   0   0
  3 unassignedwm 0   0   0
  4 unassignedwm 0   0   0
  5 unassignedwm 0   0   0
  6 unassignedwm 0   0   0
  8   reservedwm 1172149918.00MB  117231374

This isn't the output from when I did it but it is exactly the same steps that 
I followed.

Thanks for the info about slices, I may give that a go later on. I'm not keen 
on that because I have clear evidence (as in zpools set up this way, right now, 
working, without issue) that GPT partitions of the style shown above work and I 
want to see why it doesn't work in my set up rather than simply ignoring and 
moving on.

From: Fajar A. Nugraha [mailto:w...@fajar.net]
Sent: Sunday, 17 March 2013 3:04 PM
To: Andrew Werchowiecki
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] partioned cache devices

On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki 
andrew.werchowie...@xpanse.com.aumailto:andrew.werchowie...@xpanse.com.au 
wrote:
I understand that p0 refers to the whole disk... in the logs I pasted in I'm 
not attempting to mount p0. I'm trying to work out why I'm getting an error 
attempting to mount p2, after p1 has successfully mounted. Further, this has 
been done before on other systems in the same hardware configuration in the 
exact same fashion, and I've gone over the steps trying to make sure I haven't 
missed something but can't see a fault.

How did you create the partition? Are those marked as solaris partition, or 
something else (e.g. fdisk on linux use type 83 by default).

I'm not keen on using Solaris slices because I don't have an understanding of 
what that does to the pool's OS interoperability.


Linux can read solaris slice and import solaris-made pools just fine, as long 
as you're using compatible zpool version (e.g. zpool version 28).

--
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] partioned cache devices

2013-03-16 Thread Andrew Werchowiecki
It's a home set up, the performance penalty from splitting the cache devices is 
non-existant, and that work around sounds like some pretty crazy amount of 
overhead where I could instead just have a mirrored slog.

I'm less concerned about wasted space, more concerned about amount of SAS ports 
I have available.

I understand that p0 refers to the whole disk... in the logs I pasted in I'm 
not attempting to mount p0. I'm trying to work out why I'm getting an error 
attempting to mount p2, after p1 has successfully mounted. Further, this has 
been done before on other systems in the same hardware configuration in the 
exact same fashion, and I've gone over the steps trying to make sure I haven't 
missed something but can't see a fault. 

I'm not keen on using Solaris slices because I don't have an understanding of 
what that does to the pool's OS interoperability. 

From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) 
[opensolarisisdeadlongliveopensola...@nedharvey.com]
Sent: Friday, 15 March 2013 8:44 PM
To: Andrew Werchowiecki; zfs-discuss@opensolaris.org
Subject: RE: partioned cache devices

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Andrew Werchowiecki

 muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
 Password:
 cannot open '/dev/dsk/c25t10d1p2': I/O error
 muslimwookie@Pyzee:~$

 I have two SSDs in the system, I've created an 8gb partition on each drive for
 use as a mirrored write cache. I also have the remainder of the drive
 partitioned for use as the read only cache. However, when attempting to add
 it I get the error above.

Sounds like you're probably running into confusion about how to partition the 
drive.  If you create fdisk partitions, they will be accessible as p0, p1, p2, 
but I think p0 unconditionally refers to the whole drive, so the first 
partition is p1, and the second is p2.

If you create one big solaris fdisk parititon and then slice it via partition 
where s2 is typically the encompassing slice, and people usually use s1 and s2 
and s6 for actual slices, then they will be accessible via s1, s2, s6

Generally speaking, it's unadvisable to split the slog/cache devices anyway.  
Because:

If you're splitting it, evidently you're focusing on the wasted space.  Buying 
an expensive 128G device where you couldn't possibly ever use more than 4G or 
8G in the slog.  But that's not what you should be focusing on.  You should be 
focusing on the speed (that's why you bought it in the first place.)  The slog 
is write-only, and the cache is a mixture of read/write, where it should be 
hopefully doing more reads than writes.  But regardless of your actual success 
with the cache device, your cache device will be busy most of the time, and 
competing against the slog.

You have a mirror, you say.  You should probably drop both the cache  log.  
Use one whole device for the cache, use one whole device for the log.  The only 
risk you'll run is:

Since a slog is write-only (except during mount, typically at boot) it's 
possible to have a failure mode where you think you're writing to the log, but 
the first time you go back and read, you discover an error, and discover the 
device has gone bad.  In other words, without ever doing any reads, you might 
not notice when/if the device goes bad.  Fortunately, there's an easy 
workaround.  You could periodically (say, once a month) script the removal of 
your log device, create a junk pool, write a bunch of data to it, scrub it 
(thus verifying it was written correctly) and in the absence of any scrub 
errors, destroy the junk pool and re-add the device as a slog to the main pool.

I've never heard of anyone actually being that paranoid, and I've never heard 
of anyone actually experiencing the aforementioned possible undetected device 
failure mode.  So this is all mostly theoretical.

Mirroring the slog device really isn't necessary in the modern age.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] partioned cache devices

2013-03-14 Thread Andrew Werchowiecki
Hi all,

I'm having some trouble with adding cache drives to a zpool, anyone got any 
ideas?

muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
Password:
cannot open '/dev/dsk/c25t10d1p2': I/O error
muslimwookie@Pyzee:~$

I have two SSDs in the system, I've created an 8gb partition on each drive for 
use as a mirrored write cache. I also have the remainder of the drive 
partitioned for use as the read only cache. However, when attempting to add it 
I get the error above.

Here's a zpool status:

  pool: aggr0
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Feb 21 21:13:45 2013
1.13T scanned out of 20.0T at 106M/s, 51h52m to go
74.2G resilvered, 5.65% done
config:

NAME STATE READ WRITE CKSUM
aggr0DEGRADED 0 0 0
  raidz2-0   DEGRADED 0 0 0
c7t5000C50035CA68EDd0ONLINE   0 0 0
c7t5000C5003679D3E2d0ONLINE   0 0 0
c7t50014EE2B16BC08Bd0ONLINE   0 0 0
c7t50014EE2B174216Dd0ONLINE   0 0 0
c7t50014EE2B174366Bd0ONLINE   0 0 0
c7t50014EE25C1E7646d0ONLINE   0 0 0
c7t50014EE25C17A62Cd0ONLINE   0 0 0
c7t50014EE25C17720Ed0ONLINE   0 0 0
c7t50014EE206C2AFD1d0ONLINE   0 0 0
c7t50014EE206C8E09Fd0ONLINE   0 0 0
c7t50014EE602DFAACAd0ONLINE   0 0 0
c7t50014EE602DFE701d0ONLINE   0 0 0
c7t50014EE20677C1C1d0ONLINE   0 0 0
replacing-13 UNAVAIL  0 0 0
  c7t50014EE6031198C1d0  UNAVAIL  0 0 0  cannot open
  c7t50014EE0AE2AB006d0  ONLINE   0 0 0  (resilvering)
c7t50014EE65835480Dd0ONLINE   0 0 0
logs
  mirror-1   ONLINE   0 0 0
c25t10d1p1   ONLINE   0 0 0
c25t9d1p1ONLINE   0 0 0

errors: No known data errors

As you can see, I've successfully added the 8gb partitions in a write caches. 
Interestingly, when I do a zpool iostat -v it shows the total as 111gb:

capacity operationsbandwidth
pool alloc   free   read  write   read  write
---  -  -  -  -  -  -
aggr020.0T  7.27T  1.33K139  81.7M  4.19M
  raidz2 20.0T  7.27T  1.33K115  81.7M  2.70M
c7t5000C50035CA68EDd0-  -566  9  6.91M   241K
c7t5000C5003679D3E2d0-  -493  8  6.97M   242K
c7t50014EE2B16BC08Bd0-  -544  9  7.02M   239K
c7t50014EE2B174216Dd0-  -525  9  6.94M   241K
c7t50014EE2B174366Bd0-  -540  9  6.95M   241K
c7t50014EE25C1E7646d0-  -549  9  7.02M   239K
c7t50014EE25C17A62Cd0-  -534  9  6.93M   241K
c7t50014EE25C17720Ed0-  -542  9  6.95M   241K
c7t50014EE206C2AFD1d0-  -549  9  7.02M   239K
c7t50014EE206C8E09Fd0-  -526 10  6.94M   241K
c7t50014EE602DFAACAd0-  -576 10  6.91M   241K
c7t50014EE602DFE701d0-  -591 10  7.00M   239K
c7t50014EE20677C1C1d0-  -530 10  6.95M   241K
replacing-  -  0922  0  7.11M
  c7t50014EE6031198C1d0  -  -  0  0  0  0
  c7t50014EE0AE2AB006d0  -  -  0622  2  7.10M
c7t50014EE65835480Dd0-  -595 10  6.98M   239K
logs -  -  -  -  -  -
  mirror  740K   111G  0 43  0  2.75M
c25t10d1p1   -  -  0 43  3  2.75M
c25t9d1p1-  -  0 43  3  2.75M
---  -  -  -  -  -  -
rpool7.32G  12.6G  2  4  41.9K  43.2K
  c4t0d0s0   7.32G  12.6G  2  4  41.9K  43.2K
---  -  -  -  -  -  -

Something funky is going on here...

Wooks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Andrew Gabriel

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

From: Darren J Moffat [mailto:darr...@opensolaris.org]

Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.



When I search for scsi unmap, I come up with all sorts of documentation that 
... is ... like reading a medical journal when all you want to know is the 
conversion from 98.6F to C.

Would you mind momentarily, describing what SCSI UNMAP is used for?  If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem...  


Customer doesn't *necessarily* have to be as clueless as CEO/CFO.  Perhaps just 
another IT person, or whatever.
  


SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that 
some blocks are no longer needed. (This might be because a file has been 
deleted in the filesystem on the device.)


In the case of a Flash device, it can optimise usage by knowing this, 
e.g. it can perhaps perform a background erase on the real blocks so 
they're ready for reuse sooner, and/or better optimise wear leveling by 
having more spare space to play with. There are some devices in which 
this enables the device to improve its lifetime by performing better 
wear leveling when having more spare space. It can also help by avoiding 
some read-modify-write operations, if the device knows the data that is 
in the rest of the 4k block is no loner needed.


In the case of an iSCSI LUN target, these blocks no longer need to be 
archived, and if sparse space allocation is in use, the space they 
occupied can be freed off. In the particular case of ZFS provisioning 
the iSCSI LUN (COMSTAR), you might get performance improvements by 
having more free space to play with during other write operations to 
allow better storage layout optimisation.


So, bottom line is longer life of SSDs (maybe higher performance too if 
there's less waiting for erases during writes), and better space 
utilisation and performance for a ZFS COMSTAR target.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS array on marvell88sx in Solaris 11.1

2012-12-13 Thread Andrew Gabriel
3112 and 3114 were very early SATA controllers before there were any 
SATA drivers, which pretend to be ATA controllers to the OS.

No one should be using these today.

sol wrote:
Oh I can run the disks off a SiliconImage 3114 but it's the marvell 
controller that I'm trying to get working. I'm sure it's the 
controller which is used in the Thumpers so it should surely work in 
solaris 11.1



*From:* Bob Friesenhahn bfrie...@simple.dallas.tx.us

If the SATA card you are using is a JBOD-style card (i.e. disks
are portable to a different controller), are you able/willing to
swap it for one that Solaris is known to support well?




--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remove disk

2012-12-02 Thread Andrew Gabriel

Bob Friesenhahn wrote:

On Sat, 1 Dec 2012, Jan Owoc wrote:



When I would like to change the disk, I also would like change the disk
enclosure, I don't want to use the old one.


You didn't give much detail about the enclosure (how it's connected,
how many disk bays it has, how it's used etc.), but are you able to
power off the system and transfer the all the disks at once?


And what happen if I have 24, 36 disks to change ? It's take mounth 
to do

that.


Those are the current limitations of zfs. Yes, with 12x2TB of data to
copy it could take about a month.



You can create a brand new pool with the new chassis and use 'zfs 
send' to send a full snapshot of each filesystem to the new pool. 
After the bulk of the data has been transferred, take new snapshots 
and send the remainder. This expects that both pools can be available 
at once. 


or if you don't care about existing snapshots, use Shadow Migration to 
move the data across.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)

2012-10-18 Thread Andrew Gabriel

Arne Jansen wrote:

We have finished a beta version of the feature.


What does FITS stand for?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZIL faster

2012-10-04 Thread Andrew Gabriel

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Schweiss, Chip

How can I determine for sure that my ZIL is my bottleneck?  If it is the
bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to
make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.


Temporarily set sync=disabled
Or, depending on your application, leave it that way permanently.  I know, for 
the work I do, most systems I support at most locations have sync=disabled.  It 
all depends on the workload.


Noting of course that this means that in the case of an unexpected system 
outage or loss of connectivity to the disks, synchronous writes since the last 
txg commit will be lost, even though the applications will believe they are 
secured to disk. (ZFS filesystem won't be corrupted, but it will look like it's 
been wound back by up to 30 seconds when you reboot.)

This is fine for some workloads, such as those where you would start again with 
fresh data and those which can look closely at the data to see how far they got 
before being rudely interrupted, but not for those which rely on the Posix 
semantics of synchronous writes/syncs meaning data is secured on non-volatile 
storage when the function returns.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] all in one server

2012-09-18 Thread Andrew Gabriel

Richard Elling wrote:
On Sep 18, 2012, at 7:31 AM, Eugen Leitl eu...@leitl.org 
mailto:eu...@leitl.org wrote:


Can I actually have a year's worth of snapshots in
zfs without too much performance degradation?


I've got 6 years of snapshots with no degradation :-)


$ zfs list -t snapshot -r export/home | wc -l
   1951
$ echo 1951 / 365 | bc -l
5.34520547945205479452
$

So you're slightly ahead of my 5.3 years of daily snapshots:-)

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Andrew Gabriel

On 05/28/12 20:06, Iwan Aucamp wrote:
I'm getting sub-optimal performance with an mmap based database 
(mongodb) which is running on zfs of Solaris 10u9.


System is Sun-Fire X4270-M2 with 2xX5680 and 72GB (6 * 8GB + 6 * 4GB) 
ram (installed so it runs at 1333MHz) and 2 * 300GB 15K RPM disks


 - a few mongodb instances are running with with moderate IO and total 
rss of 50 GB
 - a service which logs quite excessively (5GB every 20 mins) is also 
running (max 2GB ram use) - log files are compressed after some time 
to bzip2.


Database performance is quite horrid though - it seems that zfs does 
not know how to manage allocation between page cache and arc cache - 
and it seems arc cache wins most of the time.


I'm thinking of doing the following:
 - relocating mmaped (mongo) data to a zfs filesystem with only 
metadata cache

 - reducing zfs arc cache to 16 GB

Is there any other recommendations - and is above likely to improve 
performance.


1. Upgrade to S10 Update 10 - this has various performance improvements, 
in particular related to database type loads (but I don't know anything 
about mongodb).


2. Reduce the ARC size so RSS + ARC + other memory users  RAM size.
I assume the RSS include's whatever caching the database does. In 
theory, a database should be able to work out what's worth caching 
better than any filesystem can guess from underneath it, so you want to 
configure more memory in the DB's cache than in the ARC. (The default 
ARC tuning is unsuitable for a database server.)


3. If the database has some concept of blocksize or recordsize that it 
uses to perform i/o, make sure the filesystems it is using configured to 
be the same recordsize. The ZFS default recordsize (128kB) is usually 
much bigger than database blocksizes. This is probably going to have 
less impact with an mmaped database than a read(2)/write(2) database, 
where it may prove better to match the filesystem's record size to the 
system's page size (4kB, unless it's using some type of large pages). I 
haven't tried playing with recordsize for memory mapped i/o, so I'm 
speculating here.


Blocksize or recordsize may apply to the log file writer too, and it may 
be that this needs a different recordsize and therefore has to be in a 
different filesystem. If it uses write(2) or some variant rather than 
mmap(2) and doesn't document this in detail, Dtrace is your friend.


4. Keep plenty of free space in the zpool if you want good database 
performance. If you're more than 60% full (S10U9) or 80% full (S10U10), 
that could be a factor.


Anyway, there are a few things to think about.

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs_arc_max values

2012-05-17 Thread Andrew Gabriel

On 05/17/12 15:03, Bob Friesenhahn wrote:

On Thu, 17 May 2012, Paul Kraus wrote:


   Why are you trying to tune the ARC as _low_ as possible? In my
experience the ARC gives up memory readily for other uses. The only
place I _had_ to tune the ARC in production was a  couple systems
running an app that checks for free memory _before_ trying to allocate
it. If the ARC has all but 1 GB in use, the app (which is looking for


On my system I adjusted the ARC down due to running user-space 
applications with very bursty short-term large memory usage. Reducing 
the ARC assured that there would be no contention between zfs ARC and 
the applications.


If the system is running one app which expects to do lots of application 
level caching (and in theory, the app should be able to work out what's 
worth caching and what isn't better than any filesystem underneath it 
can guess), then you should be planning your memory usage accordingly.


For example, a database server, you probably want to allocate much of 
the system's memory to
the database cache (in the case of Oracle, the SGA), leaving enough for 
a smaller ZFS arc and the memory required by the OS and app. Depends on 
the system and database size, but something like 50% SGA, 25% ZFS ARC, 
25% for everything else might be an example, with the SGA 
disproportionally bigger on larger systems with larger databases.


On my desktop system (supposed to be 8GB RAM, but currently 6GB due to a 
dead DIMM), I have knocked the ARC down to 1GB. I used to find the ARC 
wouldn't shrink in size until system had got to the point of crawling 
along showing anon page-ins, and some app (usually firefox or 
thunderbird) had already become too difficult to use. I must admit I did 
this a long time ago, and ZFS's shrinking of the ARC may be more 
proactive now than it was back then, but I don't notice any ZFS 
performance issues with the ARC restricted to 1GB on a desktop system. 
It may have increase scrub times, but that happens when I'm in bed, so I 
don't care.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Andrew Gabriel
I just played and knocked this up (note the stunning lack of comments, 
missing optarg processing, etc)...

Give it a list of files to check...

#define _FILE_OFFSET_BITS 64

#includesys/types.h

#includeunistd.h

#includestdio.h

#includesys/stat.h

#includefcntl.h

int

main(int argc, char **argv)

{

int i;

for (i = 1; i  argc; i++) {

int fd;

fd = open(argv[i], O_RDONLY);

if (fd  0) {

perror(argv[i]);

} else {

off_t eof;

off_t hole;

if (((eof = lseek(fd, 0, SEEK_END))  0) ||

lseek(fd, 0, SEEK_SET)  0) {

perror(argv[i]);

} else if (eof == 0) {

printf(%s: empty\n, argv[i]);

} else {

hole = lseek(fd, 0, SEEK_HOLE);

if (hole  0) {

perror(argv[i]);

} else if (hole  eof) {

printf(%s: sparse\n, argv[i]);

} else {

printf(%s: not sparse\n, argv[i]);

}

}

close(fd);

}

}
return 0;

}


On 03/26/12 10:06 PM, ольга крыжановская wrote:

Mike, I was hoping that some one has a complete example for a bool
has_file_one_or_more_holes(const char *path) function.

Olga

2012/3/26 Mike Gerdtsmger...@gmail.com:

2012/3/26 ольга крыжановскаяolga.kryzhanov...@gmail.com:

How can I test if a file on ZFS has holes, i.e. is a sparse file,
using the C api?

See SEEK_HOLE in lseek(2).

--
Mike Gerdts
http://mgerdts.blogspot.com/





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any recommendations on Perc H700 controller on Dell Rx10 ?

2012-03-10 Thread Andrew Gabriel

On 03/10/12 09:29, Sriram Narayanan wrote:

Hi folks:

At work, I have an R510, and R610 and an R710 - all with the H700 PERC
controller.

Based on experiments, it seems like there is no way to bypass the PERC
controller - it seems like one can only access the individual disks if
they are set up in RAID0 each.

This brings me to ask some questions:
a. Is it fine (in terms of an intelligent controller coming in the way
of ZFS) to have the PERC controllers present each drive as RAID0
drives ?
b. Would there be any errors in terms of PERC doing things that ZFS is
not aware of and this causing any issues later  ?


I had to produce a ZFS hybrid storage pool performance demo, and was 
initially given a system with a RAID-only controller (different from 
yours, but same idea).


I created the demo with it, but disabled the RAID's cache as that wasn't 
what I wanted in the picture. Meanwhile, I ordered the non-RAID version 
of the card.


When it came and I swapped it in. A couple of issues...

ZFS doesn't recognise any of the disks obviously, because they have 
propitiatory RAID headers on them, so they have to be created again from 
scratch. (That was no big deal in this case, and if it had been, I could 
have done a zfs send and receive to somewhere else temporarily.)


The performance went up, a tiny bit for the spinning disks, and by 50% 
for the SSDs, so the RAID controller was seriously limiting the IOPs of 
the SSDs in particular. This was when SSDs were relatively new, and the 
controllers may not have been designed with SSDs in mind. That's likely 
to be somewhat different nowadays, but I don't have any data to show 
that either way.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] send only difference between snapshots

2012-02-14 Thread Andrew Gabriel

skeletor wrote:
There is a task: make backup by sending snapshots to another server. 
But I don't want to send each time a complete snapshot of the system - 
I want to send only the difference between a snapshots.
For example: there are 2 servers, and I want to do the snapshot on the 
master, send only the difference between the current and recent 
snapshots on the backup and then deploy it on backup.

Any ideas how this can be done?


It's called an incremental - it's part of the zfs send command line options.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?

2012-01-15 Thread Andrew Gabriel

Gary Mills wrote:

On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote:
  

On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov jimkli...@cos.ru wrote:


Does raidzN actually protect against bitrot?
That's a kind of radical, possibly offensive, question formula
that I have lately.
  

Yup, it does. That's why many of us use it.



There's actually no such thing as bitrot on a disk.  Each sector on
the disk is accompanied by a CRC that's verified by the disk
controller on each read.  It will either return correct data or report
an unreadable sector.  There's nothing inbetween.
  


Actually, there are a number of disk firmware and cache faults 
inbetween, which zfs has picked up over the years.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Stress test zfs

2012-01-05 Thread Andrew Gabriel




grant lowe wrote:
Ok. I blew it. I didn't add enough information. Here's
some more detail:
  
Disk array is a RAMSAN array, with RAID6 and 8K stripes. I'm measuring
performance with the results of the bonnie++ output and comparing with
with the the zpool iostat output. It's with the zpool iostat I'm not
seeing a lot of writes.


Since ZFS never writes data back where it was, it can coalesce multiple
outstanding writes into fewer device writes. This may be what you're
seeing. I have a ZFS IOPs demo where the (multi-threaded) application
is performing over 10,000 synchronous write IOPs, but the underlying
devices are only performing about 1/10th of that, due to ZFS coalescing
multiple outstanding writes.

Sorry, I'm not familiar with what type of load bonnie generates.

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle EMEA Server Pre-Sales
ORACLE Corporation UK Ltd is a
company incorporated in England  Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA

Hardware and Software, Engineered
to Work Together



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Daugherity, Andrew W
Does current include sol10u10 as well as sol11?  If so, when did that go in?  
Was it in sol10u9?


Thanks,

Andrew

From: Cindy Swearingen 
cindy.swearin...@oracle.commailto:cindy.swearin...@oracle.com
Subject: Re: [zfs-discuss] Can I create a mirror for a root rpool?
Date: December 16, 2011 10:38:21 AM CST
To: Tim Cook t...@cook.msmailto:t...@cook.ms
Cc: zfs-discuss@opensolaris.orgmailto:zfs-discuss@opensolaris.org


Hi Tim,

No, in current Solaris releases the boot blocks are installed
automatically with a zpool attach operation on a root pool.

Thanks,

Cindy

On 12/15/11 17:13, Tim Cook wrote:
Do you still need to do the grub install?

On Dec 15, 2011 5:40 PM, Cindy Swearingen 
cindy.swearin...@oracle.commailto:cindy.swearin...@oracle.com
mailto:cindy.swearin...@oracle.com wrote:

   Hi Anon,

   The disk that you attach to the root pool will need an SMI label
   and a slice 0.

   The syntax to attach a disk to create a mirrored root pool
   is like this, for example:

   # zpool attach rpool c1t0d0s0 c1t1d0s0

   Thanks,

   Cindy

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-16 Thread Andrew Gabriel

On 12/16/11 07:27 AM, Gregg Wonderly wrote:
Cindy, will it ever be possible to just have attach mirror the 
surfaces, including the partition tables?  I spent an hour today 
trying to get a new mirror on my root pool.  There was a 250GB disk 
that failed.  I only had a 1.5TB handy as a replacement.  prtvtoc ... 
| fmthard does not work in this case


Can you be more specific why it fails?
I have seen a couple of cases, and I'm wondering if you're hitting the 
same thing.

Can you post the prtvtoc output of your original disk please?

and so you have to do the partitioning by hand, which is just silly to 
fight with anyway.


Gregg


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow zfs send/recv speed

2011-11-15 Thread Andrew Gabriel

 On 11/15/11 23:05, Anatoly wrote:

Good day,

The speed of send/recv is around 30-60 MBytes/s for initial send and 
17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk 
to 100+ disks in pool. But the speed doesn't vary in any degree. As I 
understand 'zfs send' is a limiting factor. I did tests by sending to 
/dev/null. It worked out too slow and absolutely not scalable.
None of cpu/memory/disk activity were in peak load, so there is of 
room for improvement.


Is there any bug report or article that addresses this problem? Any 
workaround or solution?


I found these guys have the same result - around 7 Mbytes/s for 'send' 
and 70 Mbytes for 'recv'.

http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html


Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, 
the send runs at almost 100Mbytes/sec, so it's pretty much limited by 
the ethernet.


Since you have provided none of the diagnostic data you collected, it's 
difficult to guess what the limiting factor is for you.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow zfs send/recv speed

2011-11-15 Thread Andrew Gabriel

 On 11/15/11 23:40, Tim Cook wrote:
On Tue, Nov 15, 2011 at 5:17 PM, Andrew Gabriel 
andrew.gabr...@oracle.com mailto:andrew.gabr...@oracle.com wrote:


 On 11/15/11 23:05, Anatoly wrote:

Good day,

The speed of send/recv is around 30-60 MBytes/s for initial
send and 17-25 MBytes/s for incremental. I have seen lots of
setups with 1 disk to 100+ disks in pool. But the speed
doesn't vary in any degree. As I understand 'zfs send' is a
limiting factor. I did tests by sending to /dev/null. It
worked out too slow and absolutely not scalable.
None of cpu/memory/disk activity were in peak load, so there
is of room for improvement.

Is there any bug report or article that addresses this
problem? Any workaround or solution?

I found these guys have the same result - around 7 Mbytes/s
for 'send' and 70 Mbytes for 'recv'.
http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html


Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk
mirror, the send runs at almost 100Mbytes/sec, so it's pretty much
limited by the ethernet.

Since you have provided none of the diagnostic data you collected,
it's difficult to guess what the limiting factor is for you.

-- 
Andrew Gabriel




So all the bugs have been fixed?


Probably not, but the OP's implication that zfs send has a specific rate 
limit in the range suggested is demonstrably untrue. So I don't know 
what's limiting the OP's send rate. (I could guess a few possibilities, 
but that's pointless without the data.)


I seem to recall people on this mailing list using mbuff to speed it 
up because it was so bursty and slow at one point.  IE:

http://blogs.everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/



Yes, this idea originally came from me, having analyzed the send/receive 
traffic behavior in combination with network connection behavior. 
However, it's the receive side that's bursty around the TXG commits, not 
the send side, so that doesn't match the issue the OP is seeing. (The 
buffer sizes in that blog are not optimal, although any buffer at the 
receive side will make a significant improvement if the network 
bandwidth is same order of magnitude as the send/recv are capable of.)


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub bad block list

2011-11-08 Thread Andrew Gabriel
ZFS detects far more errors that traditional filesystems will simply miss. 
This means that many of the possible causes for those errors will be 
something other than a real bad block on the disk. As Edward said, the disk 
firmware should automatically remap real bad blocks, so if ZFS did that 
too, we'd not use the remapped block, which is probably fine. For other 
errors, there's nothing wrong with the real block on the disk - it's going 
to be firmware, driver, cache corruption, or something else, so 
blacklisting the block will not solve the issue. Also, with some types of 
disk (SSD), block numbers are moved around to achieve wear leveling, so 
blacklistinng a block number won't stop you reusing that real block.


--
Andrew Gabriel (from mobile)

--- Original message ---
From: Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com

To: didier.reb...@u-bourgogne.fr, zfs-discuss@opensolaris.org
Sent: 8.11.'11,  12:50


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Didier Rebeix

 from ZFS documentation it appears unclear to me if a zpool
scrub will black list any found bad blocks so they won't be used
anymore.


If there are any physically bad blocks, such that the hardware (hard 
disk)
will return an error every time that block is used, then the disk should 
be

replaced.  All disks have a certain amount of error detection/correction
built in, and remap bad blocks internally and secretly behind the scenes,
transparent to the OS.  So if there are any blocks regularly reporting 
bad

to the OS, then it means there is a growing problem inside the disk.
Offline the disk and replace it.

It is ok to get an occasional cksum error.  Say, once a year.  Because 
the

occasional cksum error will be re-read and as long as the data is correct
the second time, no problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool replace not concluding + duplicate drive label

2011-10-27 Thread Andrew Freedman
On 28/10/2011, at 3:06 PM, Daniel Carosone wrote:

 On Thu, Oct 27, 2011 at 10:49:22AM +1100, afree...@mac.com wrote:
 Hi all,
 
 I'm seeing some puzzling behaviour with my RAID-Z.
 
 
 Indeed.  Start with zdb -l on each of the disks to look at the labels in more 
 detail.
 
 --
 Dan.

I'm reluctant to include a monstrous wall of text so I've placed the output at 
http://dl.dropbox.com/u/19420697/zdb.out.

Immediately I'm struck by the sad dearth of information on da6, the similarity 
of the da0 + da0/old subtree to the zpool status information and my total lack 
of knowledge on how to use this data in any beneficial fashion.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] solaris 10u8 hangs with message Disconnected command timeout for Target 0

2011-08-16 Thread Andrew Gabriel

Ding Honghui wrote:

Hi,

My solaris storage hangs. I login to the console and there is 
messages[1] display on the console.

I can't login into the console and seems the IO is totally blocked.

The system is solaris 10u8 on Dell R710 with disk array Dell MD3000. 2 
HBA cable connect the server and MD3000.

The symptom is random.

It is very appreciated if any one can help me out.


The SCSI target you are talking to is being reset. Unit Attention 
means it's forgotten what operating parameters have been negotiated with 
the system and is a warning the device might have been changed without 
the system knowing, and it's telling you this happened because of 
device internal reset. That sort of thing can happen if the firmware 
in the SCSI target crashes and restarts, or the power supply blips, or 
if the device was swapped. I don't know anything about a Dell MD3000, 
but given it's happened on lots of disks at the same moment following a 
timeout, it looks like the array power cycled or array firmware (if any) 
rebooted. (Not sure if a SCSI bus reset can do this or not.)



[1]
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/pci@0,0/pci8086,3410@9/pci8086,32c@0/pci1028,1f04@8 (mpt1):

Aug 16 13:14:16 nas-hz-02   Disconnected command timeout for Target 0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa1802a44b8f0ded (sd47):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679073Error Block: 1380679073
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa18029e4b8f0d61 (sd41):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679072Error Block: 1380679072
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa1802a24b8f0dc5 (sd45):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679073Error Block: 1380679073
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa18029c4b8f0d35 (sd39):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679072Error Block: 1380679072
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0
Aug 16 13:14:16 nas-hz-02 scsi: WARNING: 
/scsi_vhci/disk@g60026b900053aa1802984b8f0cd2 (sd35):
Aug 16 13:14:16 nas-hz-02   Error for Command: 
write(10)   Error Level: Retryable
Aug 16 13:14:16 nas-hz-02 scsi: Requested Block: 
1380679072Error Block: 1380679072
Aug 16 13:14:16 nas-hz-02 scsi: Vendor: 
DELL   Serial Number:
Aug 16 13:14:16 nas-hz-02 scsi: Sense Key: Unit Attention
Aug 16 13:14:16 nas-hz-02 scsi: ASC: 0x29 (device internal 
reset), ASCQ: 0x4, FRU: 0x0


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sudden drop in disk performance - WD20EURS 4k sectors to blame?

2011-08-15 Thread Andrew Gabriel

David Wragg wrote:

I've not done anything different this time from when I created the original 
(512b)  pool. How would I check ashift?
  


For a zpool called export...

# zdb export | grep ashift
ashift: 12
^C
#

As far as I know (although I don't have any WD's), all the current 4k 
sectorsize hard drives claim to be 512b sectorsize, so if you didn't do 
anything special, you'll probably have ashift=9.


I would look at a zpool iostat -v to see what the IOPS rate is (you may 
have bottomed out on that), and I would also work out average transfer 
size (although that alone doesn't necessarily tell you much - a dtrace 
quantize aggregation would be better). Also check service times on the 
disks (iostat) to see if there's one which is significantly worse and 
might be going bad.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk IDs and DD

2011-08-10 Thread Andrew Gabriel

Lanky Doodle wrote:

Oh no I am not bothered at all about the target ID numbering. I just wondered 
if there was a problem in the way it was enumerating the disks.

Can you elaborate on the dd command LaoTsao? Is the 's' you refer to a 
parameter of the command or the slice of a disk - none of my 'data' disks have 
been 'configured' yet. I wanted to ID them before adding them to pools.
  


Use p0 on x86 (whole disk, without regard to any partitioning).
Any other s or p device node may or may not be there, depending on what 
partitions/slices are on the disk.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-08 Thread Andrew Gabriel

Alexander Lesle wrote:

And what is your suggestion for scrubbing a mirror pool?
Once per month, every 2 weeks, every week.
  


There isn't just one answer.

For a pool with redundancy, you need to do a scrub just before the 
redundancy is lost, so you can be reasonably sure the remaining data is 
correct and can rebuild the redundancy.


The problem comes with knowing when this might happen. Of course, if you 
are doing some planned maintenance which will reduce the pool 
redundancy, then always do a scrub before that. However, in most cases, 
the redundancy is lost without prior warning, and you need to do 
periodic scrubs to cater for this case. I do a scrub via cron once a 
week on my home system. Having almost completely filled the pool, this 
was taking about 24 hours. However, now that I've replaced the disks and 
done a send/recv of the data across to a new larger pool which is only 
1/3rd full, that's dropped down to 2 hours.


For a pool with no redundancy, where you rely only on backups for 
recovery, the scrub needs to be integrated into the backup cycle, such 
that you will discover corrupt data before it has crept too far through 
your backup cycle to be able to find a non corrupt version of the data.


When you have a new hardware setup, I would perform scrubs more 
frequently as a further check that the hardware doesn't have any 
systemic problems, until you have gained confidence in it.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] matching zpool versions to development builds

2011-08-08 Thread Andrew Gabriel

John Martin wrote:

Is there a list of zpool versions for development builds?

I found:

http://blogs.oracle.com/stw/entry/zfs_zpool_and_file_system

where it says Solaris 11 Express is zpool version 31, but my
system has BEs back to build 139 and I have not done a zpool upgrade
since installing this system but it reports on the current
development build:

# zpool upgrade -v
This system is currently running ZFS pool version 33.


It's painfully laid out (each on a separate page), but have a look at 
http://hub.opensolaris.org/bin/view/Community+Group+zfs/31 (change the 
version on the end of the URL). It conks out at version 31 though.


I have systems back to build 125, so I tend to always force zpool 
version 19 for that (and that automatically limits zfs version to 4).


There's also some info about some builds on the zfs wikipedia page 
http://en.wikipedia.org/wiki/Zfs


--
Andrew Gabriel*

***
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send/receive and ashift

2011-07-26 Thread Andrew Gabriel
Does anyone know if it's OK to do zfs send/receive between zpools with 
different ashift values?


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Adding mirrors to an existing zfs-pool

2011-07-26 Thread Andrew Gabriel

Bernd W. Hennig wrote:

G'Day,

- zfs pool with 4 disks (from Clariion A)
- must migrate to Clariion B (so I created 4 disks with the same size,
  avaiable for the zfs)

The zfs pool has no mirrors, my idea was to add the new 4 disks from
the Clariion B to the 4 disks which are still in the pool - and later
remove the original 4 disks.

I only found in all example how to create a new pool with mirrors
but no example how to add to a pool without mirrors a mirror disk
for each disk in the pool.

- is it possible to add disks to each disk in the pool (they have different
  sizes, so I have exact add the correct disks form Clariion B to the 
  original disk from Clariion B)

- can I later remove the disks from the Clariion A, pool is intact, user
  can work with the pool
  


Depends on a few things...

What OS are you running, and what release/update or build?

What's the RAID layout of your pool zpool status?

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Mount Options

2011-07-25 Thread Andrew Gabriel

Tony MacDoodle wrote:

I have a zfs pool called logs (about 200G).
I would like to create 2 volumes using this chunk of storage.
However, they would have different mount points.
ie. 50G would be mounted as /oarcle/logs
100G would be mounted as /session/logs
 
is this possible?


Yes...

zfs create -o mountpoint=/oracle/logs logs/oracle
zfs create -o mountpoint=/session/logs logs/session

If you don't otherwise specify, the two filesystem will share the pool 
without any constraints.


If you wish to limit their max space...

zfs set quota=50g logs/oracle
zfs set quota=100g logs/session

and/or if you wish to reserve a minimum space...

zfs set reservation=50g logs/oracle
zfs set reservation=100g logs/session


Do I have to use the legacy mount options?


You don't have to.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How create a FAT filesystem on a zvol?

2011-07-12 Thread Andrew Gabriel

Gary Mills wrote:

On Sun, Jul 10, 2011 at 11:16:02PM +0700, Fajar A. Nugraha wrote:
  

On Sun, Jul 10, 2011 at 10:10 PM, Gary Mills mi...@cc.umanitoba.ca wrote:


The `lofiadm' man page describes how to export a file as a block
device and then use `mkfs -F pcfs' to create a FAT filesystem on it.

Can't I do the same thing by first creating a zvol and then creating
a FAT filesystem on it?
  

seems not.


[...]
  

Some solaris tools (like fdisk, or mkfs -F pcfs) needs disk geometry
to function properly. zvols doesn't provide that. If you want to use
zvols to work with such tools, the easiest way would be using lofi, or
exporting zvols as iscsi share and import it again.

For example, if you have a 10MB zvol and use lofi, fdisk would show
these geometry

 Total disk size is 34 cylinders
 Cylinder size is 602 (512 byte) blocks

... which will then be used if you run mkfs -F pcfs -o
nofdisk,size=20480. Without lofi, the same command would fail with

Drive geometry lookup (need tracks/cylinder and/or sectors/track:
Operation not supported



So, why can I do it with UFS?

# zfs create -V 10m rpool/vol1
# newfs /dev/zvol/rdsk/rpool/vol1
newfs: construct a new file system /dev/zvol/rdsk/rpool/vol1: (y/n)? y
Warning: 4130 sector(s) in last cylinder unallocated
/dev/zvol/rdsk/rpool/vol1:  20446 sectors in 4 cylinders of 48 tracks, 128 
sectors
10.0MB in 1 cyl groups (14 c/g, 42.00MB/g, 20160 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32,

Why is this different from PCFS?
  


UFS has known for years that drive geometries are bogus, and just fakes 
something up to keep itself happy. What UFS thinks of as a cylinder 
bares no relation to actual disk cylinders.


If you give mkfs_pcfs all the geom data it needs, then it won't try 
asking the device...


andrew@opensolaris:~# zfs create -V 10m rpool/vol1
andrew@opensolaris:~# mkfs -F pcfs -o 
fat=16,nofdisk,nsect=255,ntrack=63,size=2 /dev/zvol/rdsk/rpool/vol1

Construct a new FAT file system on /dev/zvol/rdsk/rpool/vol1: (y/n)? y
andrew@opensolaris:~# fstyp /dev/zvol/rdsk/rpool/vol1
pcfs
andrew@opensolaris:~# fsck -F pcfs /dev/zvol/rdsk/rpool/vol1
** /dev/zvol/rdsk/rpool/vol1
** Scanning file system meta-data
** Correcting any meta-data discrepancies
10143232 bytes.
0 bytes in bad sectors.
0 bytes in 0 directories.
0 bytes in 0 files.
10143232 bytes free.
512 bytes per allocation unit.
19811 total allocation units.
19811 available allocation units.
andrew@opensolaris:~# mount -F pcfs /dev/zvol/dsk/rpool/vol1 /mnt
andrew@opensolaris:~#

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 512b vs 4K sectors

2011-07-04 Thread Andrew Gabriel

Richard Elling wrote:

On Jul 4, 2011, at 6:42 AM, Lanky Doodle wrote:

  

Hiya,

I''ve been doing a lot of research surrounding this and ZFS, including some 
posts on here, though I am still left scratching my head.

I am planning on using slow RPM drives for a home media server, and it's these 
that seem to 'suffer' from a few problems;

Seagate Barracuda LP - Looks to be the only true 512b sector hard disk. Serious 
firmware issues
Western Digital Cavier Green - 4K sectors = crap write performance
Hitachi 5K3000 - Variable sector sizing (according to tech. specs)
Samsung SpinPoint F4 - Just plain old problems with them

What is the best drive of the above 4, and are 4K drives really a no-no with 
ZFS. Are there any alternatives in the same price bracket?



4K drives are fine, especially if the workload is read-mostly.

Depending on the OS, you can tell ZFS to ignore the incorrect physical sector 
size reported by some drives. Today, this is easiest in FreeBSD, a little bit more
tricky in OpenIndiana (patches and source are available for a few different 
implementations). Or you can just trick them out by starting the pool with a 4K

sector device that doesn't lie (eg, iscsi target).

  

Who would have thought choosing a hard disk could be so 'hard'!



I recommend enterprise-grade disks, none of which made your short list ;-(.
 -- richard


I'm going through this at the moment. I've bought a pair of Seagate 
Barracuda XT 2Tb disks (which are a bit more Enterprise than the list 
above), just plugged them in, and so far they're OK. Not had them long 
enough to report on longevity.


--

Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 700GB gone?

2011-06-30 Thread Andrew Gabriel

 On 06/30/11 08:50 PM, Orvar Korvar wrote:

I have a 1.5TB disk that has several partitions. One of them is 900GB. Now I 
can only see 300GB. Where is the rest? Is there a command I can do to reach the 
rest of the data? Will scrub help?


Not much to go on - no one can answer this.

How did you go about partitioning the disk?
What does the fdisk partitioning look like (if its x86)?
What does the VToC slice layout look like?
What are you using each partition and slice for?
What tells you that you can only see 300GB?

--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-28 Thread Andrew Gabriel

 On 06/27/11 11:32 PM, Bill Sommerfeld wrote:

On 06/27/11 15:24, David Magda wrote:

Given the amount of transistors that are available nowadays I think
it'd be simpler to just create a series of SIMD instructions right
in/on general CPUs, and skip the whole co-processor angle.

see: http://en.wikipedia.org/wiki/AES_instruction_set

Present in many current Intel CPUs; also expected to be present in AMD's
Bulldozer based CPUs.


I recall seeing a blog comparing the existing Solaris hand-tuned AES 
assembler performance with the (then) new AES instruction version, where 
the Intel AES instructions only got you about a 30% performance 
increase. I've seen reports of better performance improvements, but 
usually by comparing with the performance on older processors which are 
going to be slower for additional reasons then just missing the AES 
instructions. Also, you could claim better performance improvement if 
you compared against a less efficient original implementation of AES. 
What this means is that a faster CPU may buy you more crypto performance 
than the AES instructions alone will do.


My understanding from reading the Intel AES instruction set (which I 
warn might not be completely correct) is that the AES 
encryption/decryption instruction is executed between 10 and 14 times 
(depending on key length) for each 128 bits (16 bytes) of data being 
encrypted/decrypted, so it's very much part of the regular instruction 
pipeline. The code will have to loop though this process multiple times 
to process a data block bigger than 16 bytes, i.e. a double nested loop, 
although I expect it's normally loop-unrolled a fair degree for 
optimisation purposes.


Conversely, the crypto units in the T-series processors are separate 
from the CPU, and do the encryption/decryption whilst the CPU is getting 
on with something else, and they do it much faster than it could be done 
on the CPU. Small blocks are normally a problem for crypto offload 
engines because the overhead of farming off the work to the engine and 
getting the result back often means that you can do the crypto on the 
CPU faster than the time it takes to get the crypto engine started and 
stopped. However, T-series crypto is particularly good at handling small 
blocks efficiently, such as around 1kbyte which you are likely to find 
in a network packet, as it is much closer coupled to the CPU than a PCI 
crypto card can be, and performance with small packets was key for the 
crypto networking support T-series was designed for. Of course, it 
handles crypto of large blocks just fine too.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Andrew Gabriel

Richard Elling wrote:

On Jun 19, 2011, at 6:04 AM, Andrew Gabriel wrote:
  

Richard Elling wrote:


Actually, all of the data I've gathered recently shows that the number of IOPS 
does not significantly increase for HDDs running random workloads. However the 
response time does :-( My data is leading me to want to restrict the queue 
depth to 1 or 2 for HDDs.
 
  

Thinking out loud here, but if you can queue up enough random I/Os, the 
embedded disk controller can probably do a good job reordering them into less 
random elevator sweep pattern, and increase IOPs through reducing the total 
seek time, which may be why IOPs does not drop as much as one might imagine if 
you think of the heads doing random seeks (they aren't random anymore). 
However, this requires that there's a reasonable queue of I/Os for the 
controller to optimise, and processing that queue will necessarily increase the 
average response time. If you run with a queue depth of 1 or 2, the controller 
can't do this.



I agree. And disksort is in the mix, too.
  


Oh, I'd never looked at that.


This is something I played with ~30 years ago, when the OS disk driver was 
responsible for the queuing and reordering disc transfers to reduce total seek 
time, and disk controllers were dumb.



...and disksort still survives... maybe we should kill it?
  


It looks like it's possibly slightly worse than the pathologically worst 
response time case I described below...



There are lots of options and compromises, generally weighing reduction in 
total seek time against longest response time. Best reduction in total seek 
time comes from planning out your elevator sweep, and inserting newly queued 
requests into the right position in the sweep ahead. That also gives the 
potentially worse response time, as you may have one transfer queued for the 
far end of the disk, whilst you keep getting new transfers queued for the track 
just in front of you, and you might end up reading or writing the whole disk 
before you get to do that transfer which is queued for the far end. If you can 
get a big enough queue, you can modify the insertion algorithm to never insert 
into the current sweep, so you are effectively planning two sweeps ahead. Then 
the worse response time becomes the time to process one queue full, rather than 
the time to read or write the whole disk. Lots of other tricks too (e.g. 
insertion into sweeps taking into account priority, such as if

 the I/O is a synchronous or asynchronous, and age of existing queue entries). 
I had much fun playing with this at the time.



The other wrinkle for ZFS is that the priority scheduler can't re-order I/Os 
sent to the disk.
  


Does that also go through disksort? Disksort doesn't seem to have any 
concept of priorities (but I haven't looked in detail where it plugs in 
to the whole framework).



So it might make better sense for ZFS to keep the disk queue depth small for 
HDDs.
 -- richard
  


--
Andrew Gabriel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-19 Thread Andrew Gabriel

Richard Elling wrote:
Actually, all of the data I've gathered recently shows that the number of 
IOPS does not significantly increase for HDDs running random workloads. 
However the response time does :-( My data is leading me to want to restrict 
the queue depth to 1 or 2 for HDDs.
  


Thinking out loud here, but if you can queue up enough random I/Os, the 
embedded disk controller can probably do a good job reordering them into 
less random elevator sweep pattern, and increase IOPs through reducing 
the total seek time, which may be why IOPs does not drop as much as one 
might imagine if you think of the heads doing random seeks (they aren't 
random anymore). However, this requires that there's a reasonable queue 
of I/Os for the controller to optimise, and processing that queue will 
necessarily increase the average response time. If you run with a queue 
depth of 1 or 2, the controller can't do this.


This is something I played with ~30 years ago, when the OS disk driver 
was responsible for the queuing and reordering disc transfers to reduce 
total seek time, and disk controllers were dumb. There are lots of 
options and compromises, generally weighing reduction in total seek time 
against longest response time. Best reduction in total seek time comes 
from planning out your elevator sweep, and inserting newly queued 
requests into the right position in the sweep ahead. That also gives the 
potentially worse response time, as you may have one transfer queued for 
the far end of the disk, whilst you keep getting new transfers queued 
for the track just in front of you, and you might end up reading or 
writing the whole disk before you get to do that transfer which is 
queued for the far end. If you can get a big enough queue, you can 
modify the insertion algorithm to never insert into the current sweep, 
so you are effectively planning two sweeps ahead. Then the worse 
response time becomes the time to process one queue full, rather than 
the time to read or write the whole disk. Lots of other tricks too (e.g. 
insertion into sweeps taking into account priority, such as if the I/O 
is a synchronous or asynchronous, and age of existing queue entries). I 
had much fun playing with this at the time.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compare snapshot to current zfs fs

2011-06-04 Thread Andrew Gabriel

Harry Putnam wrote:

I have a sneaking feeling I'm missing something really obvious.

If you have zfs fs that see little use and have lost track of whether
changes may have occurred since last snapshot, is there some handy way
to determine if a snapshot matches its filesystem.  Or put another
way, some way to determine if the snapshot is different than its
current filesystem.

I knot about the diff tools and of course I guess one could compare
overall sizes in bytes for a good idea, but is there a way provided by
zfs?
  


If you have a recent enough OS release...

zfs diff snapshot [snapshot | filesystem]

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely slow zpool scrub performance

2011-05-14 Thread Andrew Gabriel

 On 05/14/11 01:08 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Donald Stahl

Running a zpool scrub on our production pool is showing a scrub rate
of about 400K/s. (When this pool was first set up we saw rates in the
MB/s range during a scrub).

Wait longer, and keep watching it.  Or just wait till it's done and look at
the total time required.  It is normal to have periods of high and low
during scrub.  I don't know why.


Check the IOPS per drive - you may be maxing out on one of them if it's 
in an area where there are lots of small blocks.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Andrew Gabriel

Toby Thain wrote:

On 08/05/11 10:31 AM, Edward Ned Harvey wrote:
  

...
Incidentally, does fsync() and sync return instantly or wait?  Cuz time
sync might product 0 sec every time even if there were something waiting to
be flushed to disk.



The semantics need to be synchronous. Anything else would be a horrible bug.
  


sync(2) is not required to be synchronous.
I believe that for ZFS it is synchronous, but for most other 
filesystems, it isn't (although a second sync will block until the 
actions resulting from a previous sync have completed).


fsync(3C) is synchronous.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-03 Thread Andrew Gabriel

Dan Shelton wrote:
Is anyone aware of any freeware program that can speed up copying tons 
of data (2 TB) from UFS to ZFS on same server?


I use 'ufsdump | ufsrestore'*. I would also suggest try setting 
'sync=disabled' during the operation, and reverting it afterwards. 
Certainly, fastfs (a similar although more dangerous option for ufs) 
makes ufs to ufs copying significantly faster.


*ufsrestore works fine on ZFS filesystems (although I haven't tried it 
with any POSIX ACLs on the original ufs filesystem, which would probably 
simply get lost).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] No write coalescing after upgrade to Solaris 11 Express

2011-04-27 Thread Andrew Gabriel

Matthew Anderson wrote:

Hi All,

I've run into a massive performance problem after upgrading to Solaris 11 
Express from oSol 134.

Previously the server was performing a batch write every 10-15 seconds and the 
client servers (connected via NFS and iSCSI) had very low wait times. Now I'm 
seeing constant writes to the array with a very low throughput and high wait 
times on the client servers. Zil is currently disabled.


How/Why?


 There is currently one failed disk that is being replaced shortly.

Is there any ZFS tunable to revert Solaris 11 back to the behaviour of oSol 134?
  


What does zfs get sync report?

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot remove zil device

2011-04-01 Thread Daugherity, Andrew W
Do you have any details on that CR?  Either my Google-fu is failing or Oracle 
has moved the CR database private. I haven't encountered this problem but would 
like to know if there are certain behaviors to avoid to not risk this.

Has it been fixed in Sol10 or OpenSolaris?


Thanks,

Andrew


From: Cindy Swearingen [cindy.swearin...@oracle.com]
Sent: Thursday, March 31, 2011 1:55 PM
To: Roy Sigurd Karlsbakk
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Cannot remove zil device

You can add and remove mirrored or non-mirrored log devices.

Jordan is probably running into CR 7000154:

cannot remove log device

Thanks,

Cindy

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-23 Thread Andrew Gabriel

Mauricio Tavares wrote:
Perhaps a bit off-topic (I asked on the rescue list -- 
http://web.archiveorange.com/archive/v/OaDWVGdLhxWVWIEabz4F -- and was 
told to try here), but I am kinda shooting in the dark: I have been 
finding online scattered and vague info stating that this card can be 
made to work with a sparc solaris 10 box 
(http://old.nabble.com/eSATA-or-firewire-in-Solaris-Sparc-system-td27150246.html 
is the only link I can offer right now). Can anyone confirm or deny that? 


3112/3114 was a very early (possibly the first?) SATA chipset, I think 
aimed for use before SATA drivers had been developed. I would suggest 
looking for something more modern.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-05 Thread Andrew Gabriel

Roy Sigurd Karlsbakk wrote:

Nope. Most HDDs today have a single read channel, and they select
which head uses that channel at any point in time. They cannot use
multiple heads at the same time, because the heads to not travel the
same path on their respective surfaces at the same time. There's no
real vertical alignment of the tracks between surfaces, and every
surface has its own embedded position information that is used when
that surface's head is active. There were attempts at multi-actuator
designs with separate servo arms and multiple channels, but
mechanically they're too difficult to manufacture at high yields as I
understood it.



Perhaps a stupid question, but why don't they read from all platters in 
parallel?
  


The answer is in the text you quoted above.

There are drives now with two level actuators.
The primary actuator is the standard actuator you are familiar with 
which moves all the arms.
The secondary actuator is a piezo crystal towards the head end of the 
arm which can move the head a few tracks very quickly without having to 
move the arm, and these are one per head. In theory, this might allow 
multiple heads to lock on to their respective tracks at the same time 
for parallel reads, but I haven't heard that they are used in this way.


If you go back to the late 1970's before tracks had embedded servo data, 
on multi-platter disks you had one surface which contained the head 
positioning servo data, and the drive relied on accurate vertical 
alignment between heads/surfaces to keep on track (and drives could 
head-switch instantly). Around 1980, tracks got too close together for 
this to work anymore, and the servo positioning data was embedded into 
each track itself. The very first drives of this type scanned all the 
surfaces on startup to build up an internal table of the relative 
misalignment of tracks across the surfaces, but this rapidly became 
unviable as drive capacity increased and this scan would take an 
unreasonable length of time. It may be that modern drives learn this as 
they go - I don't know.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and TRIM

2011-01-29 Thread Andrew Gabriel

Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey

My google-fu is coming up short on this one...  I didn't see that it had


been
  

discussed in a while ...



BTW, there were a bunch of places where people said ZFS doesn't need trim.
Which I hope, by now, has been commonly acknowledged as bunk.

The only situation where you don't need TRIM on a SSD is when (a) you're
going to fill it once and never write to it again, which is highly unlikely
considering the fact that you're buying a device for its fast write
performance... (b) you don't care about performance, which is highly
unlikely considering the fact that you bought a performance device ...  (c)
you are using whole disk encryption.  This is a valid point.  You would
probably never TRIM anything from a fully encrypted disk ... 


In places where people said TRIM was thought to be unnecessary, the
justification they stated was that TRIM will only benefit people whose usage
patterns are sporadic, rather than sustained.  The downfall of that argument
is the assumption that the device can't perform TRIM operations
simultaneously while performing other operations.  That may be true in some
cases, or even globally, but without backing, it's just an assumption.  One
which I find highly quesitonable.


TRIM could also be useful where ZFS uses a storage LUN which is sparsely 
provisioned, in order to deallocate blocks in the LUN which have 
previously been allocated, but whose contents have since been invalidated.


In this case, both ZFS and whatever is providing the storage LUN would 
need to support TRIM.


Out of interest, what other filesystems out there today can generate 
TRIM commands?


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] show hdd life time ?

2011-01-23 Thread Andrew Gabriel

Richard Elling wrote:

On Jan 21, 2011, at 7:36 PM, Tobias Lauridsen wrote:

  

it is possible to se my hdd total time it have been in use  so I can switch to 
a new one before it gets too many hours old



In theory, yes. In practice, I've never seen a disk properly report this data 
on a
consistent basis :-(  Perhaps some of the more modern disks do a better job?

Look for the power on hours (POH) attribute of SMART.
http://en.wikipedia.org/wiki/S.M.A.R.T.
  


If you're looking for stats to give an indication of likely wear, and
thus increasing probably of failure, POH is probably not very useful by
itself (or even at all). Things like Head Flying Hours and Load Cycle
Count are probably more indicative, although not necessarily maintained
by all drives.

Of course, data which gives indication of actual (rather than likely)
wear is even more important as an indicator of impending failure, such
as the various error and retry counts.

--
Andrew Gabriel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] incorrect vdev added to pool

2011-01-18 Thread Andrew Gabriel

 On 01/15/11 11:32 PM, Gal Buki wrote:

Hi

I have a pool with a raidz2 vdev.
Today I accidentally added a single drive to the pool.

I now have a pool that partially has no redundancy as this vdev is a single 
drive.

Is there a way to remove the vdev


Not at the moment, as far as I know.


and replace it with a new raidz2 vdev?
If not what can I do to do damage control and add some redundancy to the single 
drive vdev?


I think you should be able to attach another disk to it to make them 
into a mirror. (Make sure you attach, and not add.)


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?

2010-11-15 Thread Andrew Gabriel

sridhar surampudi wrote:

Hi Darren,

In shot I am looking a way to freeze and thaw for zfs file system so that for 
harware snapshot, i can do
1. run zfs freeze 
2. run hardware snapshot on devices belongs to the zpool where the given file system is residing.

3. run zfs thaw
Unlike other filesystems, ZFS is always consistent on disk, so there's 
no need to freeze a zpool to take a hardware snapshot. The hardware 
snapshot will effectively contain all transactions up to the last 
transaction group commit, plus all synchronous transactions up to the 
hardware snapshot. If you want to be sure that all transactions up to a 
certain point in time are included (for the sake of an application's 
data), take a ZFS snapshot (which will force a TXG commit), and then 
take the hardware snapshot. You will not be able to access the hardware 
snapshot from the system which has the original zpool mounted, because 
the two zpools will have the same pool GUID (there's an RFE outstanding 
on fixing this).


The one thing you do need to be careful of is, that with a multi-disk 
zpool, the hardware snapshot is taken at an identical point in time 
across all the disks in the zpool. This functionality is usually an 
extra-charge option in Enterprise storage systems. If the hardware 
snapshots are staggered across multiple disks, all bets are off, 
although if you take a zfs snapshot immediately beforehand and you test 
import/scrub the hardware snapshot (on a different system) immediately 
(so you can repeat the hardware snapshot again if it fails), maybe you 
will be lucky.


The right way to do this with zfs is to send/recv the datasets to a 
fresh zpool, or (S10 Update 9) to create an extra zpool mirror and then 
split it off with zpool split.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing GUID

2010-11-15 Thread Andrew Gabriel

sridhar surampudi wrote:

Hi I am looking in similar lines,

my requirement is 


1. create a zpool on one or many devices ( LUNs ) from an array ( array can be 
IBM or HPEVA or EMC etc.. not SS7000).
2. Create file systems on zpool
3. Once file systems are in use (I/0 is happening) I need to take snapshot at 
array level
 a. Freeze the zfs flle system ( not required due to zfs consistency : source : 
mailing groups)
 b. take array snapshot ( say .. IBM flash copy )
 c. Got new snapshot device (having same data and metadata including same GUID 
of source pool)

  Now I need a way to change the GUID and pool of snapshot device so that the 
snapshot device can be accessible on same host or an alternate host (if the LUN 
is shared).

Could you please post commands for the same.


There is no way I know of currently. (There was an unofficial program 
floating around to do this on much earlier opensolaris versions, but it 
no longer works).


If you have a support contract, raise a call and asked to be added to 
RFE 6744320.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how to quiesce and unquiesc zfs and zpool for array/hardware snapshots ?

2010-11-15 Thread Andrew Gabriel

 Sridhar,

You have switched to a new disruptive filesystem technology, and it has 
to be disruptive in order to break out of all the issues older 
filesystems have, and give you all the new and wonderful features. 
However, you are still trying to use old filesystem techniques with it, 
which is why things don't fit for you, and you are missing out on the 
more powerful way ZFS presents these features to you.


On 11/16/10 06:59 AM, Ian Collins wrote:

On 11/16/10 07:19 PM, sridhar surampudi wrote:

Hi,

How it would help for instant recovery  or point in time recovery ?? 
i.e restore data at device/LUN level ?


Why would you want to?  If you are sending snapshots to another pool, 
you can do instant recovery at the pool level.


Point in time recovery is a feature of ZFS snapshots. What's more, with 
ZFS you can see all your snapshots online all the time, read and/or 
recover just individual files or whole datasets, and the storage 
overhead is very efficient.


If you want to recover a whole LUN, that's presumably because you lost 
the original, and in this case the system won't have the original 
filesystem mounted.




Currently it is easy as I can unwind the primary device stack and 
restore data at device/ LUN level and recreate stack.


It's probably easier with ZFS to restore data at the pool or 
filesystem level from snapshots.


Trying to work at the device level is just adding an extra level of 
complexity to a problem already solved.




I won't claim ZFS couldn't better support use of back-end Enterprise 
storage, but in this case, you haven't given any use cases where that's 
relevant.


--
Andrew
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do you use 1 partition on x86?

2010-10-25 Thread Andrew Gabriel
Change the new partition type to something than none of the OS's on the 
system will know anything about, so they don't make any invalid 
assumptions about what might be in it. Then use the appropriate 
partition device node, /dev/dsk/c7t0d0p4 (assuming it's the 4th primary 
FDISK partition).


Multiple zpools on one disk is not going to be good for performance if 
you use both together. There may be some way to grow the existing 
Solaris partition into the spare space without destroying the contents 
and then growing the zpool into the new space, but I haven't tried this 
with FDISK partitions, so I don't know if it works without damaging the 
existing contents. (I have done it with slices, and it does work in that 
case.)


Bill Werner wrote:

So when I built my new workstation last year, I partitioned the one and only 
disk in half, 50% for Windows, 50% for 2009.06.   Now, I'm not using Windows, 
so I'd like to use the other half for another ZFS pool, but I can't figure out 
how to access it.

I have used fdisk to create a second Solaris2 partition, did a re-con reboot, 
but format still only shows the 1 available partition.  How do I used the 
second partition?

selecting c7t0d0
 Total disk size is 30401 cylinders
 Cylinder size is 16065 (512 byte) blocks

   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1 Other OS  0 4   5  0
  2 IFS: NTFS 5  19171913  6
  3   ActiveSolaris2   1917  1497113055 43
  4 Solaris2   14971  3017015200 50

 format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c7t0d0 DEFAULT cyl 13052 alt 2 hd 255 sec 63
  /p...@0,0/pci1028,2...@1f,2/d...@0,0


Thanks for any idea.
  



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Andrew Gabriel

casper@sun.com wrote:

On Tue, Oct 5, 2010 at 11:49 PM,  casper@sun.com wrote:


I'm not sure that that is correct; the drive works on naive clients but I
believe it can reveal its true colors.
  

The drive reports 512 byte sectors to all hosts. AFAIK there's no way
to make it report 4k sectors.




Too bad because it makes it less useful (specifically because the label 
mentions sectors and if you can use bigger sectors, you can address a 
larger drive).
  


Having now read a number of forums about these, there's a strong feeling 
WD screwed up by not providing a switch to disable pseudo 512b access so 
you can use the 4k native. The industry as a whole will transition to 4k 
sectorsize over next few years, but these first 4k sectorsize HDs are 
rather less useful with 4k sectorsize-aware OS's. Let's hope other 
manufacturers get this right in their first 4k products.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Andrew Gabriel

Michael DeMan wrote:
The WD 1TB 'enterprise' drives are still 512 sector size and safe to 
use, who knows though, maybe they just started shipping with 4K sector 
size as I write this e-mail?


Another annoying thing with the whole 4K sector size, is what happens 
when you need to replace drives next year, or the year after?  That 
part has me worried on this whole 4K sector migration thing more than 
what to buy today.  Given the choice, I would prefer to buy 4K sector 
size now, but operating system support is still limited.  Does anybody 
know if there any vendors that are shipping 4K sector drives that have 
a jumper option to make them 512 size?  WD has a jumper, but is there 
explicitly to work with WindowsXP, and is not a real way to dumb down 
the drive to 512.  I would presume that any vendor that is shipping 4K 
sector size drives now, with a jumper to make it 'real' 512, would be 
supporting that over the long run?


Changing the sector size (if it's possible at all) would require a
reformat of the drive.

On SCSI disks which support it, you do it by changing the sector size on
the relevant mode select page, and then sending a format-unit command to
make the drive relayout all the sectors.

I've no idea if these 4K sata drives have any such mechanism, but I
would expect they would.

BTW, I've been using a pair of 1TB Hitachi Ultrastar for something like
18 months without any problems at all. Of course, a 1 year old disk
model is no longer available now. I'm going to have to swap out for
bigger disks in the not too distant future.

--
Andrew Gabriel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fs root inode number?

2010-09-26 Thread Andrew Gabriel




Richard L. Hamilton wrote:

  Typically on most filesystems, the inode number of the root
directory of the filesystem is 2, 0 being unused and 1 historically
once invisible and used for bad blocks (no longer done, but kept
reserved so as not to invalidate assumptions implicit in ufsdump tapes).

However, my observation seems to be (at least back at snv_97), the
inode number of ZFS filesystem root directories (including at the
top level of a spool) is 3, not 2.

If there's any POSIX/SUS requirement for the traditional number 2,
I haven't found it.  So maybe there's no reason founded in official
standards for keeping it the same.  But there are bound to be programs
that make what was with other filesystems a safe assumption.

Perhaps a warning is in order, if there isn't already one.

Is there some _reason_ why the inode number of filesystem root directories
in ZFS is 3 rather than 2?
  


If you look at zfs_create_fs(), you will see the first 3 items created
are:

Create zap object used for SA attribute registration
Create a delete queue.
Create root znode.


Hence, inode 3.

-- 

Andrew Gabriel




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Adding a higher level partition to ZFS pool

2010-09-24 Thread Andrew Gabriel

Axelle Apvrille wrote:

Hi all,
I would like to add a new partition to my ZFS pool but it looks like it's more 
stricky than expected.

The layout of my disk is the following:
- first partition for Windows. I want to keep it. (no formatting !)
- second partition for OpenSolaris.This is where I have all the Solaris slices 
(c0d0s0 etc). I have a single ZFS pool. OpenSolaris boots on ZFS.
- third partition: a FAT partition I want to keep (no formatting !)
- fourth partition: I want to add this partition to my ZFS pool (or another 
pool ?). I don't care if information on that partition is lost, I can format it 
if necessary.
zpool add c0d0p0:2 ? Hmm...
  


You cannot add it to the root pool, as the root pool cannot be a RAID0.

You can make another pool from it...

Ideally, set the FDISK partition type to something that none of the OS's 
on the system will know anything about. (It doesn't matter what it is 
from the zfs point of view, but you don't want any of the OS's thinking 
it's something they believe they know how to use.)


zpool create tank c0d0p4 (for the 4th FDISK primary partition).

Note that two zpools on the same disk may give you poor performance if 
you are accessing both at the same time, as you are forcing head seeking 
between them.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Failed zfs send invalid backup stream.............

2010-09-12 Thread Andrew Gabriel

Humberto Ramirez wrote:

I'm trying to replicate a 300 GB pool with this command

zfs send al...@3 | zfs receive -F omega

about 2 hours in to the process it fails with this error

cannot receive new filesystem stream: invalid backup stream

I have tried setting the target read only  (zfs set readonly=on omega)
also disable Timeslider thinking it might have something to do with it.

What could be causing the error ?


Could be zfs filesystem version too old on the sending side (I have one 
such case).

What are their versions, and what release/build of the OS are you using?

if the target is a new hard drive can I use 
this   zfs send al...@3  /dev/c10t0d0 ?
  


That command doesn't make much sense for the purpose of doing anything 
useful.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...

2010-08-24 Thread Andrew Gabriel
.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Andrew Gabriel

Alxen4 wrote:

Is there any way run start-up script before non-root pool is mounted ?

For example I'm trying to use ramdisk as ZIL device (ramdiskadm )
So I need to create ramdisk before actual pool is mounted otherwise it 
complains that log device is missing :)

For sure I can manually remove/and add it  by script and put the script in 
regular rc2.d location...I'm just looking for more elegant way to it.
  


Can you start by explaining what you're trying to do, because this may 
be completely misguided?


A ramdisk is volatile, so you'll lose it when system goes down, causing 
failure to mount on reboot. Recreating a ramdisk on reboot won't 
recreate the slog device you lost when the system went down. I expect 
the zpool would fail to mount.


Furthermore, using a ramdisk as a ZIL is effectively just a very 
inefficient way to disable the ZIL.
A better way to do this is to zfs set sync=disabled ... on relevant 
filesystems.
I can't recall which build introduced this, but prior to that, you can 
set zfs://zil_disable=1 in /etc/system but that applies to all 
pools/filesystems.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Andrew Gabriel

Andrew Gabriel wrote:

Alxen4 wrote:

Is there any way run start-up script before non-root pool is mounted ?

For example I'm trying to use ramdisk as ZIL device (ramdiskadm )
So I need to create ramdisk before actual pool is mounted otherwise 
it complains that log device is missing :)


For sure I can manually remove/and add it by script and put the 
script in regular rc2.d location...I'm just looking for more elegant 
way to it.


Can you start by explaining what you're trying to do, because this may 
be completely misguided?


A ramdisk is volatile, so you'll lose it when system goes down, 
causing failure to mount on reboot. Recreating a ramdisk on reboot 
won't recreate the slog device you lost when the system went down. I 
expect the zpool would fail to mount.


Furthermore, using a ramdisk as a ZIL is effectively just a very 
inefficient way to disable the ZIL.
A better way to do this is to zfs set sync=disabled ... on relevant 
filesystems.
I can't recall which build introduced this, but prior to that, you can 
set zfs://zil_disable=1 in /etc/system but that applies to all 
pools/filesystems.




The double-slash was brought to you by a bug in thunderbird. The 
original read: set zfs:zil_disable=1


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Andrew Gabriel

Alxen4 wrote:

Thanks...Now I think I understand...

Let me summarize it andd let me know if I'm wrong.

Disabling ZIL converts all synchronous calls to asynchronous which makes ZSF to 
report data acknowledgment before it actually was written to stable storage 
which in turn improves performance but might cause data corruption in case of 
server crash.

Is it correct ?

In my case I'm having serious performance issues with NFS over ZFS.
  


You need a non-volatile slog, such as an SSD.


My NFS Client is ESXi so the major question is there risk of corruption for 
VMware images if I disable ZIL ?
  


Yes.

If your NFS server takes an unexpected outage and comes back up again, 
some writes will have been lost which ESXi thinks succeeded (typically 5 
to 30 seconds worth of writes/updates immediately before the outage). So 
as an example, if you had an application writing a file sequentially, 
you will likely find an area of the file is corrupt because the data was 
lost.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread andrew . gabriel
What you say is true only on the system itself. On an NFS client system, 30 
seconds of lost data in the middle of a file (as per my earlier example) is a 
corrupt file.

-original message-
Subject: Re: [zfs-discuss] Solaris startup script location
From: Edward Ned Harvey sh...@nedharvey.com
Date: 18/08/2010 17:17

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Alxen4
 
 Disabling ZIL converts all synchronous calls to asynchronous which
 makes ZSF to report data acknowledgment before it actually was written
 to stable storage which in turn improves performance but might cause
 data corruption in case of server crash.
 
 Is it correct ?

It is partially correct.

With the ZIL disabled, you could lose up to 30 sec of writes, but it won't
cause an inconsistent filesystem, or corrupt data.  If you make a
distinction between corrupt and lost data, then this is valuable for you
to know:

Disabling the ZIL can result in up to 30sec of lost data, but not corrupt
data.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS automatic rollback and data rescue.

2010-08-14 Thread Andrew Gabriel

Constantine wrote:

Hi.

I've got the ZFS filesystem (opensolaris 2009.06), witch, as i can see, was 
automatically rollbacked by OS to the lastest snapshot after the power failure.


ZFS doesn't do this.
Can you give some more details of what you're seeing?
Would also be useful to see output of: zfs list -t all -r zpool/filesystem


 There is a trouble - snapshot is too old, and ,consequently,  there is a 
questions -- Can I browse pre-rollbacked corrupted branch of FS ? And, if I 
can,  how ?
  

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS automatic rollback and data rescue.

2010-08-14 Thread Andrew Gabriel

Constantine wrote:

ZFS doesn't do this.


I thought so too. ;)

Situation brief: I've got OpenSolaris 2009.06 installed on the RAID-5 array on 
the controller with 512 Mb cache (as i can remember) without a cache-saving 
battery.


I hope the controller disabled the cache then.
Probably a good idea to run zpool scrub rpool to find out if it's 
broken. It will probably take some time. zpool status will show the 
progress.



At the Friday lightning bolt hit the power supply station of colocating 
company,and turned out that their UPSs not much more then decoration. After 
reboot filesystem and logs are on their last snapshot version.

  

Would also be useful to see output of: zfs list -t all -r zpool/filesystem



wi...@zeus:~/.zfs/snapshot# zfs list -t all -r rpool
NAME   USED  AVAIL  REFER  MOUNTPOINT
rpool  427G  1.37T  82.5K  /rpool
rpool/ROOT 366G  1.37T19K  legacy
rpool/ROOT/opensolaris20.6M  1.37T  3.21G  /
rpool/ROOT/xvm8.10M  1.37T  8.24G  /
rpool/ROOT/xvm-1   690K  1.37T  8.24G  /
rpool/ROOT/xvm-2  35.1G  1.37T   232G  /
rpool/ROOT/xvm-3   851K  1.37T   221G  /
rpool/ROOT/xvm-4   331G  1.37T   221G  /
rpool/ROOT/xv...@install   144M  -  2.82G  -
rpool/ROOT/xv...@xvm  38.3M  -  3.21G  -
rpool/ROOT/xv...@2009-07-27-01:09:1456K  -  8.24G  -
rpool/ROOT/xv...@2009-07-27-01:09:5756K  -  8.24G  -
rpool/ROOT/xv...@2009-09-13-23:34:54  2.30M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:35:17  1.14M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:42:12  5.72M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:42:45  5.69M  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:46:25   573K  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:46:34   525K  -   206G  -
rpool/ROOT/xv...@2009-09-13-23:48:11  6.51M  -   206G  -
rpool/ROOT/xv...@2010-04-22-03:50:25  24.6M  -   221G  -
rpool/ROOT/xv...@2010-04-22-03:51:28  24.6M  -   221G  -
  


Actually, there's 24.6Mbytes worth of changes to the filesystem since 
the last snapshot, which is coincidentally about the same as there was 
over the preceding minute between the last two snapshots. I can't tell 
if (or how much of) that happened before, verses after, the reboot though.



rpool/dump16.0G  1.37T  16.0G  -
rpool/export  28.6G  1.37T21K  /export
rpool/export/home 28.6G  1.37T21K  /export/home
rpool/export/home/wiron   28.6G  1.37T  28.6G  /export/home/wiron
rpool/swap16.0G  1.38T   101M  -
=
  


Normally in a power-out scenario, you will only lose asynchronous writes 
since the last transaction group commit, which will be up to 30 seconds 
worth (although normally much less), and you lose no synchronous writes.


However, I've no idea what your potentially flaky RAID array will have 
done. If it was using its cache and thinking it was non-volatile, then 
it could easily have corrupted the zfs filesystem due to having got 
writes out of sequence with transaction commits, and this can render the 
filesystem no longer mountable because the back-end storage has lied to 
zfs about committing writes. Even though you were lucky and it still 
mounts, it might still be corrupted, hence the suggestion to run zpool 
scrub (and even more important, get the RAID array fixed). Since I 
presume ZFS doesn't have redundant storage for this zpool, any corrupted 
data can't be repaired by ZFS, although it will tell you about it. 
Running ZFS without redundancy on flaky storage is not a good place to be.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAID Z stripes

2010-08-10 Thread Andrew Gabriel

Phil Harman wrote:

On 10 Aug 2010, at 08:49, Ian Collins i...@ianshome.com wrote:


On 08/10/10 06:21 PM, Terry Hull wrote:
I am wanting to build a server with 16 - 1TB drives with 2 – 8 drive 
RAID Z2 arrays striped together. However, I would like the 
capability of adding additional stripes of 2TB drives in the future. 
Will this be a problem? I thought I read it is best to keep the 
stripes the same width and was planning to do that, but I was 
wondering about using drives of different sizes. These drives would 
all be in a single pool.


It would work, but you run the risk of the smaller drives becoming 
full and all new writes doing to the bigger vdev. So while usable, 
performance would suffer.


Almost by definition, the 1TB drives are likely to be getting full 
when the new drives are added (presumably because of running out of 
space).


Performance can only be said to suffer relative to a new pool built 
entirely with drives of the same size. Even if he added 8x 2TB drives 
in a RAIDZ3 config it is hard to predict what the performance gap will 
be (on the one hand: RAIDZ3 vs RAIDZ2, on the other: an empty group vs 
an almost full, presumably fragmented, group).


One option would be to add 2TB drives as 5 drive raidz3 vdevs. That 
way your vdevs would be approximately the same size and you would 
have the optimum redundancy for the 2TB drives.


I think you meant 6, but I don't see a good reason for matching the 
group sizes. I'm for RAIDZ3, but I don't see much logic in mixing 
groups of 6+2 x 1TB and 3+3 x 2TB in the same pool (in one group I 
appear to care most about maximising space, in the other I'm 
maximising availability)


Another option - use the new 2TB drives to swap out the existing 1TB drives.
If you can find another use for the swapped out drives, this works well, 
and avoids ending up with sprawling lower capacity drives as your pool 
grows in size. This is what I do at home. The freed-up drives get used 
in other systems and for off-site backups. Over the last 4 years, I've 
upgraded from 1/4TB, to 1/2TB, and now on 1TB drives.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Global Spare for 2 pools

2010-08-10 Thread Andrew Gabriel

Tony MacDoodle wrote:
I have 2 ZFS pools all using the same drive type and size. The 
question is can I have 1 global hot spare for both of those pools?


Yes. A hot spare disk can be added to more than one pool at the same time.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS SCRUB

2010-08-09 Thread Andrew Gabriel

Mohammed Sadiq wrote:

Hi
 
Is it recommended to do scrub while the filesystem is mounted . How 
frequently do we have to do scrub and at what circumstances.


You can scrub while the filesystems are mounted - most people do, 
there's no reason to unmount for for a scrub. (Scrub is pool level, not 
filesystem level.)


Scrub does noticeably slow the filesystem, so pick a time of low 
application load or a time when performance isn't critical. If it 
overruns into a busy period, you can cancel the scrub. Unfortunately, 
you can't pause and resume - there's an RFE for this, so if you cancel 
one you can't restart it from where it got to - it has to restart from 
the beginning.


You should scrub occasionally anyway. That's your check that data you 
haven't accessed in your application isn't rotting on the disks.


You should also do a scrub before you do a planned reduction of the pool 
redundancy (e.g. if you're going to detach a mirror side in order to 
attach a larger disk), most particularly if you are reducing the 
redundancy to nothing.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Maximum zfs send/receive throughput

2010-08-06 Thread Andrew Gabriel

Jim Barker wrote:

Just an update, I had a ticket open with Sun regarding this and it looks like 
they have a CR for what I was seeing (6975124).
  


That would seem to describe a zfs receive which has stopped for 12 hours.
You described yours as slow, which is not the term I personally would 
use for one which is stopped.
However, you haven't given anything like enough detail here of your 
situation and what's happening for me to make any worthwhile guesses.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance?

2010-07-23 Thread Andrew Gabriel

Thomas Burgess wrote:



On Fri, Jul 23, 2010 at 3:11 AM, Sigbjorn Lie sigbj...@nixtra.com 
mailto:sigbj...@nixtra.com wrote:


Hi,

I've been searching around on the Internet to fine some help with
this, but have been
unsuccessfull so far.

I have some performance issues with my file server. I have an
OpenSolaris server with a Pentium D
3GHz CPU, 4GB of memory, and a RAIDZ1 over 4 x Seagate
(ST31500341AS) 1,5TB SATA drives.

If I compile or even just unpack a tar.gz archive with source code
(or any archive with lots of
small files), on my Linux client onto a NFS mounted disk to the
OpenSolaris server, it's extremely
slow compared to unpacking this archive on the locally on the
server. A 22MB .tar.gz file
containng 7360 files takes 9 minutes and 12seconds to unpack over NFS.

Unpacking the same file locally on the server is just under 2
seconds. Between the server and
client I have a gigabit network, which at the time of testing had
no other significant load. My
NFS mount options are: rw,hard,intr,nfsvers=3,tcp,sec=sys.

Any suggestions to why this is?


Regards,
Sigbjorn


as someone else said, adding an ssd log device can help hugely.  I saw 
about a 500% nfs write increase by doing this.

I've heard of people getting even more.


Another option if you don't care quite so much about data security in 
the event of an unexpected system outage would be to use Robert 
Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes 
with sync=disabled, when the changes work their way into an available 
build. The risk is that if the file server goes down unexpectedly, it 
might come back up having lost some seconds worth of changes which it 
told the client (lied) that it had committed to disk, when it hadn't, 
and this violates the NFS protocol. That might be OK if you are using it 
to hold source that's being built, where you can kick off a build again 
if the server did go down in the middle of it. Wouldn't be a good idea 
for some other applications though (although Linux ran this way for many 
years, seemingly without many complaints). Note that there's no 
increased risk of the zpool going bad - it's just that after the reboot, 
filesystems with sync=disabled will look like they were rewound by some 
seconds (possibly up to 30 seconds).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance?

2010-07-23 Thread Andrew Gabriel

Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Phil Harman

Milkowski and Neil Perrin's zil synchronicity [PSARC/2010/108] changes
with sync=disabled, when the changes work their way into an available

The fact that people run unsafe systems seemingly without complaint for
years assumes that they know silent data corruption when they
see^H^H^Hhear it ... which, of course, they didn't ... because it is
silent ... or having encountered corrupted data, that they have the
faintest idea where it came from. In my day to day work I still find
many people that have been (apparently) very lucky.



Running with sync disabled, or ZIL disabled, you could call unsafe if you
want to use a generalization and a stereotype.  


Just like people say writeback is unsafe.  If you apply a little more
intelligence, you'll know, it's safe in some conditions, and not in other
conditions.  Like ... If you have a BBU, you can use your writeback safely.
And if you're not sharing stuff across the network, you're guaranteed the
disabled ZIL is safe.  But even when you are sharing stuff across the
network, the disabled ZIL can still be safe under the following conditions:

If you are only doing file sharing (NFS, CIFS) and you are willing to
reboot/remount from all your clients after an ungraceful shutdown of your
server, then it's safe to run with ZIL disabled.
  


No, that's not safe. The client can still lose up to 30 seconds of data, 
which could be, for example, an email message which is received and 
foldered on the server, and is then lost. It's probably /*safe enough*/ 
for most home users, but you should be fully aware of the potential 
implications before embarking on this route.


(As I said before, the zpool itself is not at any additional risk of 
corruption, it's just that you might find the zfs filesystems with 
sync=disabled appear to have been rewound by up to 30 seconds.)



If you're unsure, then adding SSD nonvolatile log device, as people have
said, is the way to go.
  


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?

2010-07-19 Thread Andrew Gabriel

Richard Jahnel wrote:

I've tried ssh blowfish and scp arcfour. both are CPU limited long before the 
10g link is.

I'vw also tried mbuffer, but I get broken pipe errors part way through the 
transfer.
  


Any idea why? Does the zfs send or zfs receive bomb out part way through?

Might be worth trying it over rsh if security isn't an issue, and then 
you lose the encryption overhead. Trouble is that then you've got almost 
no buffering, which can do bad things to the performance, which is why 
mbuffer would be ideal if it worked for you.



I'm open to ideas for faster ways to to either zfs send directly or through a 
compressed file of the zfs send output.

For the moment I;

zfs send  pigz
scp arcfour the file gz file to the remote host
gunzip  to zfs receive

This takes a very long time for 3 TB of data, and barely makes use the 10g 
connection between the machines due to the CPU limiting on the scp and gunzip 
processes.
  


Also, if you have multiple datasets to send, might be worth seeing if 
sending them in parallel helps.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recommended RAM for ZFS on various platforms

2010-07-16 Thread Andrew Gabriel

Garrett D'Amore wrote:

Btw, instead of RAIDZ2, I'd recommend simply using stripe of mirrors.
You'll have better performance, and good resilience against errors.  And
you can grow later as you need to by just adding additional drive pairs.

-- Garrett
  


Or in my case, I find my home data growth is slightly less than the rate 
of disk capacity increase, so every 18 months or so, I simply swap out 
the disks for higher capacity ones.



--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-12 Thread Andrew Gabriel

Linder, Doug wrote:

Out of sheer curiosity - and I'm not disagreeing with you, just wondering - how 
does ZFS make money for Oracle when they don't charge for it?  Do you think 
it's such an important feature that it's a big factor in customers picking 
Solaris over other platforms?
  


Yes, it is one of many significant factors in customers choosing Solaris 
over other OS's.
Having chosen Solaris, customers then tend to buy Sun/Oracle systems to 
run it on.


Of course, there are the 7000 series products too, which are heavily 
based on the capabilities of ZFS, amongst other Solaris features.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lost ZIL Device - FIXED

2010-07-08 Thread Andrew Kener
Greetings All,

 

I can't believe it didn't figure this out sooner.  First of all, a big thank
you to everyone who gave me advice and suggestions, especially Richard.  The
problem was with the -d switch.  When importing a pool if you specify -d and
a path it ONLY looks there.  So if I run:

 

# zpool import -d /var/zfs-log/ tank

 

It won't look for devices in /dev/dsk

Consequently running without -d /var/zfs-log/ it won't find the log device.
Here is the command that worked:

 

# zpool import -d /var/zfs-log -d /dev/dsk tank

 

And to make sure that this doesn't happen again (I have learned my lesson
this time) I have ordered two small SSD drives to put in a mirrored config
for the log device.  Thanks again to everyone and now I will get some
worry-free sleep :)

 

Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lost ZIL Device

2010-07-07 Thread Andrew Kener
According to 'zpool upgrade' my pool versions are are 22.  All pools were 
upgraded several months ago, including the one in question.  Here is what I get 
when I try to import:

fileserver ~ # zpool import 9013303135438223804
cannot import 'tank': pool may be in use from other system, it was last 
accessed by fileserver (hostid: 0x406155) on Tue Jul  6 10:46:13 2010
use '-f' to import anyway

fileserver ~ # zpool import -f 9013303135438223804
cannot import 'tank': one or more devices is currently unavailable
Destroy and re-create the pool from
a backup source.

On Jul 6, 2010, at 11:48 PM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Andrew Kener
 
 the OS hard drive crashed [and log device]
 
 Here's what I know:  In zpool = 19, if you import this, it will prompt you
 to confirm the loss of the log device, and then it will import.
 
 Here's what I have heard:  The ability to import with a failed log device as
 described above, was created right around zpool 14 or 15, not quite sure
 which.
 
 Here's what I don't know:  If the failed zpool was some version which was
 too low ... and you try to import on an OS which is capable of a much higher
 version of zpool ... Can the newer OS handle it just because the newer OS is
 able to handle a newer version of zpool?  Or maybe the version of the failed
 pool is the one that matters, regardless of what the new OS is capable of
 doing now?
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-06 Thread Andrew Jones
 
 Good. Run 'zpool scrub' to make sure there are no
 other errors.
 
 regards
 victor
 

Yes, scrubbed successfully with no errors. Thanks again for all of your 
generous assistance.

/AJ
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Lost ZIL Device

2010-07-06 Thread Andrew Kener
Hello All,

 

I've recently run into an issue I can't seem to resolve.  I have been
running a zpool populated with two RAID-Z1 VDEVs and a file on the
(separate) OS drive for the ZIL:

 

raidz1-0   ONLINE

c12t0d0  ONLINE

c12t1d0  ONLINE

c12t2d0  ONLINE

c12t3d0  ONLINE

raidz1-2   ONLINE

c12t4d0  ONLINE

c12t5d0  ONLINE

c13t0d0  ONLINE

c13t1d0  ONLINE

logs

/ZIL-Log.img

 

This was running on Nexenta Community Edition v3.  Everything was going
smoothly until today when the OS hard drive crashed and I was not able to
boot from it any longer.  I had migrated this setup from an OpenSolaris
install some months back and I still had the old drive intact.  I put it in
the system, booted it up and tried to import the zpool.  Unfortunately, I
have not been successful.  Previously when migrating from OSOL to Nexenta I
was able to get the new system to recognize and import the ZIL device file.
Since it has been lost in the drive crash I have not been able to duplicate
that success.

 

Here is the output from a 'zpool import' command:

 

pool: tank

id: 9013303135438223804

state: UNAVAIL

status: The pool was last accessed by another system.

action: The pool cannot be imported due to damaged devices or data.

   see: http://www.sun.com/msg/ZFS-8000-EY

config:

 

tank UNAVAIL  missing device

  raidz1-0   ONLINE

c12t0d0  ONLINE

c12t1d0  ONLINE

c12t5d0  ONLINE

c12t3d0  ONLINE

  raidz1-2   ONLINE

c12t4d0  ONLINE

c12t2d0  ONLINE

c13t0d0  ONLINE

c13t1d0  ONLINE

 

I created a new file for the ZIL (using mkfile) and tried to specify it for
inclusion with -d but it doesn't get recognized.  Probably because it was
never part of the original zpool.  I also symlinked the new ZIL file into
/dev/dsk but that didn't make any difference either.  

 

Any suggestions?

 

Andrew Kener

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-04 Thread Andrew Jones
 
 - Original Message -
  Victor,
  
  The zpool import succeeded on the next attempt
 following the crash
  that I reported to you by private e-mail!
  
  For completeness, this is the final status of the
 pool:
  
  
  pool: tank
  state: ONLINE
  scan: resilvered 1.50K in 165h28m with 0 errors on
 Sat Jul 3 08:02:30
 
 Out of curiosity, what sort of drives are you using
 here? Resilvering in 165h28m is close to a week,
 which is rather bad imho.

I think the resilvering statistic is quite misleading, in this case. We're 
using very average 1TB retail Hitachi disks, which perform just fine when the 
pool is healthy.

What happened here is that the zpool-tank process was performing a resilvering 
task in parallel with the processing of a very large inconsistent dataset, 
which took the overwhelming majority of the time to complete.

Why it actually took over a week to process the 2TB volume in an inconsistent 
state is my primary concern with the performance of ZFS, in this case.

 
 Vennlige hilsener / Best regards
 
 roy
 --
 Roy Sigurd Karlsbakk
 (+47) 97542685
 r...@karlsbakk.net
 http://blogg.karlsbakk.net/
 --
 I all pedagogikk er det essensielt at pensum
 presenteres intelligibelt. Det er et elementært
 imperativ for alle pedagoger å unngå eksessiv
 anvendelse av idiomer med fremmed opprinnelse. I de
 fleste tilfeller eksisterer adekvate og relevante
 synonymer på norsk.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-03 Thread Andrew Jones
Victor,

The zpool import succeeded on the next attempt following the crash that I 
reported to you by private e-mail! 

For completeness, this is the final status of the pool:


  pool: tank
 state: ONLINE
 scan: resilvered 1.50K in 165h28m with 0 errors on Sat Jul  3 08:02:30 2010
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  raidz2-0  ONLINE   0 0 0
c0t0d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
cache
  c2t0d0ONLINE   0 0 0

errors: No known data errors

Thank you very much for your help. We did not need to add additional RAM to 
solve this, in the end. Instead, we needed to persist with the import through 
several panics to finally work our way through the large inconsistent dataset; 
it is unclear whether the resilvering caused additional processing delay. 
Unfortunately, the delay made much of the data quite stale, now that it's been 
recovered.

It does seem that zfs would benefit tremendously from a better (quicker and 
more intuitive?) set of recovery tools, that are available to a wider range of 
users. It's really a shame, because the features and functionality in zfs are 
otherwise absolutely second to none.

/Andrew[i][/i][i][/i][i][/i][i][/i][i][/i]
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-02 Thread Andrew Jones
 Andrew,
 
 Looks like the zpool is telling you the devices are
 still doing work of 
 some kind, or that there are locks still held.
 

Agreed; it appears the CSV1 volume is in a fundamentally inconsistent state 
following the aborted zfs destroy attempt. See later in this thread where 
Victor has identified this to be the case. I am awaiting his analysis of the 
latest crash.

 From man of section 2 intro page the errors are
  listed.  Number 16 
 ooks to be an EBUSY.
 
 
   16 EBUSYDevice busy
 An attempt was made to mount
  a  dev-
 ice  that  was already
 mounted or an
 attempt was made to
 unmount a device
 on  which  there  is
  an active file
 (open   file,   current
   directory,
 mounted-on  file,  active
  text seg-
 ment). It  will  also
  occur  if  an
 attempt is made to
 enable accounting
 when it  is  already
  enabled.   The
 device or resource is
 currently una-
 vailable.   EBUSY is
  also  used  by
 mutexes, semaphores,
 condition vari-
 ables, and r/w  locks,
  to  indicate
 that   a  lock  is held,
  and by the
 processor  control
  function
  P_ONLINE.
 ndrew Jones wrote:
  Just re-ran 'zdb -e tank' to confirm the CSV1
 volume is still exhibiting error 16:
 
  snip
  Could not open tank/CSV1, error 16
  snip
 
  Considering my attempt to delete the CSV1 volume
 lead to the failure in the first place, I have to
 think that if I can either 1) complete the deletion
 of this volume or 2) roll back to a transaction prior
 to this based on logging or 3) repair whatever
 corruption has been caused by this partial deletion,
 that I will then be able to import the pool.
 
  What does 'error 16' mean in the ZDB output, any
 suggestions?
 
 
 -- 
 Geoff Shipman | Senior Technical Support Engineer
 Phone: +13034644710
 Oracle Global Customer Services
 500 Eldorado Blvd. UBRM-04 | Broomfield, CO 80021
 Email: geoff.ship...@sun.com | Hours:9am-5pm
 MT,Monday-Friday
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-01 Thread Andrew Jones
Victor,

I've reproduced the crash and have vmdump.0 and dump device files. How do I 
query the stack on crash for your analysis? What other analysis should I 
provide?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-07-01 Thread Andrew Jones
Victor,

A little more info on the crash, from the messages file is attached here. I 
have also decompressed the dump with savecore to generate unix.0, vmcore.0, and 
vmdump.0.


Jun 30 19:39:10 HL-SAN unix: [ID 836849 kern.notice] 
Jun 30 19:39:10 HL-SAN ^Mpanic[cpu3]/thread=ff0017909c60: 
Jun 30 19:39:10 HL-SAN genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf 
Page fault) rp=ff0017909790 addr=0 occurred in module unknown due to a 
NULL pointer dereference
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN unix: [ID 839527 kern.notice] sched: 
Jun 30 19:39:10 HL-SAN unix: [ID 753105 kern.notice] #pf Page fault
Jun 30 19:39:10 HL-SAN unix: [ID 532287 kern.notice] Bad kernel fault at 
addr=0x0
Jun 30 19:39:10 HL-SAN unix: [ID 243837 kern.notice] pid=0, pc=0x0, 
sp=0xff0017909880, eflags=0x10002
Jun 30 19:39:10 HL-SAN unix: [ID 211416 kern.notice] cr0: 
8005003bpg,wp,ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de
Jun 30 19:39:10 HL-SAN unix: [ID 624947 kern.notice] cr2: 0
Jun 30 19:39:10 HL-SAN unix: [ID 625075 kern.notice] cr3: 336a71000
Jun 30 19:39:10 HL-SAN unix: [ID 625715 kern.notice] cr8: c
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rdi:  282 
rsi:15809 rdx: ff03edb1e538
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rcx:5  
r8:0  r9: ff03eb2d6a00
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]rax:  202 
rbx:0 rbp: ff0017909880
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r10: f80d16d0 
r11:4 r12:0
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]r13: ff03e21bca40 
r14: ff03e1a0d7e8 r15: ff03e21bcb58
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]fsb:0 
gsb: ff03e25fa580  ds:   4b
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] es:   4b  
fs:0  gs:  1c3
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice]trp:e 
err:   10 rip:0
Jun 30 19:39:10 HL-SAN unix: [ID 592667 kern.notice] cs:   30 
rfl:10002 rsp: ff0017909880
Jun 30 19:39:10 HL-SAN unix: [ID 266532 kern.notice] ss:   38
Jun 30 19:39:10 HL-SAN unix: [ID 10 kern.notice] 
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909670 
unix:die+dd ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909780 
unix:trap+177b ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909790 
unix:cmntrap+e6 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 802836 kern.notice] ff0017909880 0 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098a0 
unix:debug_enter+38 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179098c0 
unix:abort_sequence_enter+35 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909910 
kbtrans:kbtrans_streams_key+102 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909940 
conskbd:conskbdlrput+e7 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099b0 
unix:putnext+21e ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff00179099f0 
kbtrans:kbtrans_queueevent+7c ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a20 
kbtrans:kbtrans_queuepress+7c ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a60 
kbtrans:kbtrans_untrans_keypressed_raw+46 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909a90 
kbtrans:kbtrans_processkey+32 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909ae0 
kbtrans:kbtrans_streams_key+175 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b10 
kb8042:kb8042_process_key+40 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b50 
kb8042:kb8042_received_byte+109 ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909b80 
kb8042:kb8042_intr+6a ()
Jun 30 19:39:10 HL-SAN genunix: [ID 655072 kern.notice] ff0017909bb0 
i8042:i8042_intr+c5 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c00 
unix:av_dispatch_autovect+7c ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0017909c40 
unix:dispatch_hardint+33 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183552f0 
unix:switch_sp_and_call+13 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355340 
unix:do_interrupt+b8 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355350 
unix:_interrupt+b8 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff00183554a0 
unix:htable_steal+198 ()
Jun 30 19:39:11 HL-SAN genunix: [ID 655072 kern.notice] ff0018355510 
unix:htable_alloc+248 ()
Jun 30 19:39:11 

Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-29 Thread Andrew Jones
 
 On Jun 29, 2010, at 8:30 PM, Andrew Jones wrote:
 
  Victor,
  
  The 'zpool import -f -F tank' failed at some point
 last night. The box was completely hung this morning;
 no core dump, no ability to SSH into the box to
 diagnose the problem. I had no choice but to reset,
 as I had no diagnostic ability. I don't know if there
 would be anything in the logs?
 
 It sounds like it might run out of memory. Is it an
 option for you to add more memory to the box
 temporarily?

I'll place the order for more memory or transfer some from another machine. 
Seems quite likely that we did run out of memory.

 
 Even if it is an option, it is good to prepare for
 such outcome and have kmdb loaded either at boot time
 by adding -k to 'kernel$' line in GRUB menu, or by
 loading it from console with 'mdb -K' before
 attempting import (type ':c' at mdb prompt to
 continue). In case it hangs again, you can press
 'F1-A' on the keyboard, drop into kmdb and then use
 '$systemdump' to force a crashdump.

I'll prepare the machine this way and repeat the import to reproduce the hang, 
then break into the kernel and capture the core dump.

 
 If you hardware has physical or virtual NMI button,
 you can use that too to drop into kmdb, but you'll
 need to set a kernel variable for that to work:
 
 http://blogs.sun.com/darren/entry/sending_a_break_to_o
 pensolaris
 
  Earlier I ran 'zdb -e -bcsvL tank' in write mode
 for 36 hours and gave up to try something different.
 Now the zpool import has hung the box.
 
 What do you mean be running zdb in write mode? zdb
 normally is readonly tool. Did you change it in some
 way?

I had read elsewhere that set /zfs/:zfs_recover=/1/ and set aok=/1/ placed zdb 
into some kind of a write/recovery mode. I have set these in /etc/system. Is 
this a bad idea in this case?

 
  Should I try zdb again? Any suggestions?
 
 It sounds like zdb is not going to be helpful, as
 inconsistent dataset processing happens only in
 read-write mode. So you need to try above suggestions
 with more memory and kmdb/nmi.

Will do, thanks!

 
 victor
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Now at 36 hours since zdb process start and:


 PID USERNAME  SIZE   RSS STATE  PRI NICE  TIME  CPU PROCESS/NLWP
   827 root 4936M 4931M sleep   590   0:50:47 0.2% zdb/209

Idling at 0.2% processor for nearly the past 24 hours... feels very stuck. 
Thoughts on how to determine where and why?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Update: have given up on the zdb write mode repair effort, as least for now. 
Hoping for any guidance / direction anyone's willing to offer...

Re-running 'zpool import -F -f tank' with some stack trace debug, as suggested 
in similar threads elsewhere. Note that this appears hung at near idle.


ff03e278c520 ff03e9c60038 ff03ef109490   1  60 ff0530db4680
  PC: _resume_from_idle+0xf1CMD: zpool import -F -f tank
  stack pointer for thread ff03e278c520: ff00182bbff0
  [ ff00182bbff0 _resume_from_idle+0xf1() ]
swtch+0x145()
cv_wait+0x61()
zio_wait+0x5d()
dbuf_read+0x1e8()
dnode_next_offset_level+0x129()
dnode_next_offset+0xa2()
get_next_chunk+0xa5()
dmu_free_long_range_impl+0x9e()
dmu_free_object+0xe6()
dsl_dataset_destroy+0x122()
dsl_destroy_inconsistent+0x5f()
findfunc+0x23()
dmu_objset_find_spa+0x38c()
dmu_objset_find_spa+0x153()
dmu_objset_find+0x40()
spa_load_impl+0xb23()
spa_load+0x117()
spa_load_best+0x78()
spa_import+0xee()
zfs_ioc_pool_import+0xc0()
zfsdev_ioctl+0x177()
cdev_ioctl+0x45()
spec_ioctl+0x5a()
fop_ioctl+0x7b()
ioctl+0x18e()
dtrace_systrace_syscall32+0x11a()
_sys_sysenter_post_swapgs+0x149()
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Dedup had been turned on in the past for some of the volumes, but I had turned 
it off altogether before entering production due to performance issues. GZIP 
compression was turned on for the volume I was trying to delete.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Malachi,

Thanks for the reply. There were no snapshots for the CSV1 volume that I 
recall... very few snapshots on the any volume in the tank.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Just re-ran 'zdb -e tank' to confirm the CSV1 volume is still exhibiting error 
16:

snip
Could not open tank/CSV1, error 16
snip

Considering my attempt to delete the CSV1 volume lead to the failure in the 
first place, I have to think that if I can either 1) complete the deletion of 
this volume or 2) roll back to a transaction prior to this based on logging or 
3) repair whatever corruption has been caused by this partial deletion, that I 
will then be able to import the pool.

What does 'error 16' mean in the ZDB output, any suggestions?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs indefinitely (retry post in parts; too long?)

2010-06-28 Thread Andrew Jones
Thanks Victor. I will give it another 24 hrs or so and will let you know how it 
goes...

You are right, a large 2TB volume (CSV1) was not in the process of being 
deleted, as described above. It is showing error 16 on  'zdb -e'
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable

2010-06-22 Thread Andrew Gabriel

Gordon Ross wrote:

Anyone know why my ZFS filesystem might suddenly start
giving me an error when I try to ls -d the top of it?
i.e.: ls -d /tank/ws/fubar
/tank/ws/fubar: Operation not applicable

zpool status says all is well.  I've tried snv_139 and snv_137
(my latest and previous installs).  It's an amd64 box.
Both OS versions show the same problem.

Do I need to run a scrub?  (will take days...)

Other ideas?
  


It might be interesting to run it under truss, to see which syscall is 
returning that error.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Native ZFS for Linux

2010-06-12 Thread andrew
 On 6/10/2010 9:04 PM, Rodrigo E. De León Plicet
 wrote:
  On Tue, Jun 8, 2010 at 7:14 PM, Anurag
 Agarwalanu...@kqinfotech.com  wrote:
 
  We at KQInfotech, initially started on an
 independent port of ZFS to linux.
  When we posted our progress about port last year,
 then we came to know about
  the work on LLNL port. Since then we started
 working on to re-base our
  changing on top Brian's changes.
 
  We are working on porting ZPL on that code. Our
 current status is that
  mount/unmount is working. Most of the directory
 operations and read/write is
  also working. There is still lot more development
 work and testing that
  needs to be going in this. But we are committed to
 make this happen so
  please stay tuned.
   
 
  Good times ahead!
 
 I don't mean to be a PITA, but I'm assuming that
 someone lawyerly has had the appropriate discussions
 with the porting team about how linking against the
 GPL'd Linux kernel means your kernel module has to be
 GPL-compatible.  It doesn't matter if you distribute
 it outside the general kernel source tarball, what
 matters is that you're linking against a GPL program,
 and the old GPL v2 doesn't allow for a
 non-GPL-compatibly-licensed module to do that.

This is incorrect. The viral effects of the GPL only take effect at the point 
of distribution. If ZFS is distributed seperately to the Linux kernel as a 
module then the person doing the combining is the user. Different if a Linux 
distro wanted to include it on a live CD, for example. GPL is not concerned 
with what code is linked with what.

Cheers

Andrew.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is it possible to disable MPxIO during OpenSolaris installation?

2010-06-02 Thread Andrew Gabriel




James C. McPherson wrote:
On
2/06/10 03:11 PM, Fred Liu wrote:
  
  Fix some typos.


#


In fact, there is no problem for MPxIO name in technology.

It only matters for storage admins to remember the name.

  
  
You are correct.
  
  
  I think there is no way to give short aliases
to these long tedious MPxIO name.

  
  
You are correct that we don't have aliases. However, I do not
  
agree that the naming is tedious. It gives you certainty about
  
the actual device that you are dealing with, without having
  
to worry about whether you've cabled it right.
  


Might want to add a call record to

 CR 6901193 Need a command to list current usage of disks,
partitions, and slices

which includes a request for vanity naming for disks.

(Actually, vanity naming for disks should probably be brought out into
a separate RFE.)

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle Pre-Sales
Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom

ORACLE Corporation UK Ltd is a
company incorporated in England  Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA


Oracle is committed to developing practices and products that
help protect the environment




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] unsetting the bootfs property possible? imported a FreeBSD pool

2010-05-25 Thread Andrew Gabriel




Reshekel Shedwitz wrote:

  r...@nexenta:~# zpool set bootfs= tank
cannot set property for 'tank': property 'bootfs' not supported on EFI labeled devices

r...@nexenta:~# zpool get bootfs tank
NAME  PROPERTY  VALUE   SOURCE
tank  bootfstanklocal

Could this be related to the way FreeBSD's zfs partitioned my disk? I thought ZFS used EFI by default though (except for boot pools). 
  


Looks like this bit of code to me:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libzfs/common/libzfs_pool.c#473

473 			/*
474 			 * bootfs property cannot be set on a disk which has
475 			 * been EFI labeled.
476 			 */
477 			if (pool_uses_efi(nvroot)) {
478 zfs_error_aux(hdl, dgettext(TEXT_DOMAIN,
479 "property '%s' not supported on "
480 "EFI labeled devices"), propname);
481 (void) zfs_error(hdl, EZFS_POOL_NOTSUP, errbuf);
482 zpool_close(zhp);
483 goto error;
484 			}
485 			zpool_close(zhp);
486 			break;


It's not checking if you're clearing the property before bailing out
with the error about setting it.
A few lines above, another test (for a valid bootfs name) does get
bypassed in the case of clearing the property.

Don't know if that alone would fix it.

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle Pre-Sales
Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom

ORACLE Corporation UK Ltd is a
company incorporated in England  Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA


Oracle is committed to developing practices and products that
help protect the environment




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [ZIL device brainstorm] intel x25-M G2 has ram cache?

2010-05-24 Thread Andrew Gabriel




Erik Trimble wrote:

  
Frankly, I'm really surprised that there's no solution, given that the
*amount* of NVRAM needed for ZIL (or similar usage) is really quite
small. a dozen GB is more than sufficient, and really, most systems do
fine with just a couple of GB (3-4 or so). Producing a small,
DRAM-based device in a 3.5" HD form-factor with built-in battery
shouldn't be hard, and I'm kinda flabberghasted nobody is doing it.
Well, at least in the sub-$1000 category. I mean, it's 2 SODIMMs, a
AAA-NiCad battery, a PCI-E-DDR2 memory controller, a PCI-E to
SATA6Gbps controller, and that's it. 


It's a bit of a wonky design. The DRAM could do something of the order
1,000,000 IOPS, and is then throttled back to a tiny fraction of that
by the SATA bottleneck. Disk interfaces like SATA/SAS really weren't
designed for this type of use.

What you probably want is a motherboard which has a small area of main
memory protected by battery, and a ramdisk driver which knows how to
use it.
Then you'd get the 1,000,000 IOPS. No idea if anyone makes such a thing.

You are correct that ZFS gets an enormous benefit from even tiny
amounts if NV ZIL. Trouble is that no other operating systems or
filesystems work this well with such relatively tiny amounts of NV
storage, so such a hardware solution is very ZFS-specific.

-- 

Andrew Gabriel |
Solaris Systems Architect
Email: andrew.gabr...@oracle.com
Mobile: +44 7720 598213
Oracle Pre-Sales
Guillemont Park | Minley Road | Camberley | GU17 9QG | United Kingdom

ORACLE Corporation UK Ltd is a
company incorporated in England  Wales | Company Reg. No. 1782505
| Reg. office: Oracle Parkway, Thames Valley Park, Reading RG6 1RA


Oracle is committed to developing practices and products that
help protect the environment




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs question

2010-05-20 Thread Andrew Gabriel

Mihai wrote:

hello all,

I have the following scenario of using zfs.
- I have a HDD images that has a NTFS partition stored in a zfs 
dataset in a file called images.img
- I have X physical machines that boot from my server via iSCSI from 
such an image
- Every time a machine ask for a boot request from my server a clone 
of the zfs dataset is created and the machine is given the clone to 
boot from


I want to make an optimization to my framework that involves using a 
ramdisk pool to store the initial hdd images and the clones of the 
image being stored on a disk based pool. 
I tried to do this using zfs, but it wouldn't let me do cross pool clones.


If someone has any idea on how to proceed in doing this, please let me 
know. It is not necessary to do this exactly as I proposed, but it has 
to be something in this direction, a ramdisk backed initial image and 
more disk backed clones.


You haven't said what your requirement is - i.e. what are you hoping to 
improve by making this change? I can only guess.


If you are reading blocks from your initial hdd images (golden images) 
frequently enough, and you have enough memory on your system, these 
blocks will end up on the ARC (memory) anyway. If you don't have enough 
RAM for this to help, then you could add more memory, and/or an SSD as a 
L2ARC device (cache device in zpool command line terms).


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Does Opensolaris support thin reclamation?

2010-05-05 Thread Andrew Chace
Support for thin reclamation depends on the SCSI WRITE SAME command; see this 
draft of a document from T10: 

http://www.t10.org/ftp/t10/document.05/05-270r0.pdf. 

I spent some time searching the source code for support for WRITE SAME, but I 
wasn't able to find much. I assume that if it was supported, it would be listed 
in this header file:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/scsi/generic/commands.h

Does anyone know for certain whether Opensolaris supports thin reclamation on 
thinly-provisioned LUNs? If not, is anyone interested in or actively working on 
this? 

I'm especially interested in ZFS' support for thin reclamation, but I would be 
interested in hearing about support (or lack of) for UFS and SVM as well.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How to clear invisible, partially received snapshots?

2010-04-29 Thread Andrew Daugherity
I currently use zfs send/recv for onsite backups [1], and am configuring
it for replication to an offsite server as well.  I did an initial full
send, and then a series of incrementals to bring the offsite pool up to
date.

During one of these transfers, the offsite server hung, and I had to
power-cycle it.  It came back up just fine, except that the snapshot it
was receiving when it hung appeared to be both present and nonexistent,
depending on which command was run.

'zfs recv' complained that the target snapshot already existed, but it
did not show up in the output of 'zfs list', and 'zfs destroy' said it
did not exist.

I ran a scrub, which did not find any errors; nor did it solve the
problem.  I discovered some useful commands with zdb [2], and found more
info:

zdb -d showed the snapshot, with an unusual name:
Dataset backup/ims/%zfs-auto-snap_daily-2010-04-22-1900 [ZPL], ID 6325,
cr_txg 28137403, 2.62T, 123234 objects

As opposed to a normal snapshot: 
Dataset backup/i...@zfs-auto-snap_daily-2010-04-21-1900 [ZPL], ID 5132,
cr_txg 27472350, 2.61T, 123200 objects

I then attempted 
'zfs destroy backup/ims/%zfs-auto-snap_daily-2010-04-22-1900', but it
still said the dataset did not exist.

Finally I exported the pool, and after importing it, the snapshot was
gone, and I could receive the snapshot normally.

Is there a way to clear a partial snapshot without an export/import
cycle?


Thanks,

Andrew

[1]
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-December/034554.html
[2] http://www.cuddletech.com/blog/pivot/entry.php?id=980

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-04-25 Thread Andrew Kener
The correct URL is:

http://code.google.com/p/maczfs/

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Rich Teer
Sent: Sunday, April 25, 2010 7:11 PM
To: Alex Blewitt
Cc: ZFS discuss
Subject: Re: [zfs-discuss] Mac OS X clients with ZFS server

On Fri, 23 Apr 2010, Alex Blewitt wrote:

   For your information, the ZFS project lives (well, limps really) on
   at http://code.google.com/p/mac-zfs. You can get ZFS for Snow Leopard
   from there and we're working on moving forwards from the ancient pool
   support to something more recent. I've relatively recently merged in
   the onnv-gate repository (at build 72) which should make things easier
   to track in the future.
 
  That's good to hear!  I thought Apple yanking ZFS support from Mac OS
was
  a really dumb idea.  Do you work for Apple?

 No, the entire effort is community based. Please feel free to join up to
the
 mailing list from the project page if you're interested in ZFS on Mac OSX.

I tried going to that URL, but got a 404 error...  :-(  What's the correct
one,
please?

--
Rich Teer, Publisher
Vinylphile Magazine

www.vinylphilemag.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Can't import pool due to missing log device

2010-04-18 Thread Andrew Kener
Hullo All:

 

I'm having a problem importing a ZFS pool.  When I first built my fileserver
I created two VDEVs and a log device as follows:

 

raidz1-0   ONLINE

 c12t0d0  ONLINE

 c12t1d0  ONLINE

 c12t2d0  ONLINE

 c12t3d0  ONLINE

raidz1-2   ONLINE

 c12t4d0  ONLINE

 c12t5d0  ONLINE

 c13t0d0  ONLINE

 c13t1d0  ONLINE

logs

 /ZIL-Log.img

 

And put them into a pool.  The log that was a file I created on my OS drive
for the ZIL (/ZIL-Log.img).  

 

I wanted to rebuild my server using Nexenta http://www.nexenta.org/  so I
exported the pool and tried to import it under Nexenta.  I made sure to copy
the ZIL-Log.img file to the new root partition of the Nexenta install so it
would be there for the import.  However, upon booting into the Nexenta
install the pool shows as UNAVAIL and when trying to import I get this:

 

status: One or more devices are missing from the system.

action: The pool cannot be imported. Attach the missing

devices and try again.

   see: http://www.sun.com/msg/ZFS-8000-6X

 

I went back to my OpenSolaris install to see if I could import the pool in
it's original environment but no such luck.  It still shows as UNAVAIL and I
can't get it to import.

 

At this point I am about to try the instructions shown here
(http://opensolaris.org/jive/thread.jspa?threadID=62831) but before I went
down that road I thought I'd check with the mailing list to see if anyone
has encountered this or something similar before.  Thanks in advance for any
suggestions.

 

Andrew Kener

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   >