from:"Jim Klimov"

[zfs-discuss] How to enforce probing of all disks?

2013-03-22 Thread Jim Klimov


Hello all,

  I have a kind of lame question here: how can I force the system (OI)
to probe all the HDD controllers and disks that it can find, and be
certain that it has searched everywhere for disks?

  My remotely supported home-NAS PC was unavailable for a while, and
a friend rebooted it for me from a LiveUSB image with SSH (oi_148a).
I can see my main pool disks, but not the old boot (rpool) drive.
Meaning, that it does not appear in zpool import nor in format
outputs. While it is possible that it has finally kicked the bucket,
and that won't really be unexpected, I'd like to try and confirm.

  For example, it might fail to spin up or come into contact with
the SATA cable initially - but subsequent probing of the same
controller might just find it. Happened before, too - though
via a reboot and full POST... The friend won't be available for a
few days, and there's no other remote management nor inspection
facility for this box, so I'd like to probe from within OI as much
as I can. Should be an educational quest, too ;)

# cfgadm -al
Ap_Id  Type Receptacle   Occupant 
Condition

Slot36 sata/hp  connectedconfigured   ok
sata0/0::dsk/c5t0d0disk connectedconfigured   ok
sata0/1::dsk/c5t1d0disk connectedconfigured   ok
sata0/2::dsk/c5t2d0disk connectedconfigured   ok
sata0/3::dsk/c5t3d0disk connectedconfigured   ok
sata0/4::dsk/c5t4d0disk connectedconfigured   ok
sata0/5::dsk/c5t5d0disk connectedconfigured   ok
sata1/0sata-portemptyunconfigured ok
sata1/1sata-portemptyunconfigured ok
... (USB reports follow)

# devfsadm -Cv  -- nothing new found

Nothing of interest in dmesg...

# scanpci -v | grep -i ata
 Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller
 JMicron Technology Corp. JMB362/JMB363 Serial ATA Controller
 JMicron Technology Corp. JMB362/JMB363 Serial ATA Controller

# prtconf -v | grep -i ata
name='ata-dma-enabled' type=string items=1
name='atapi-cd-dma-enabled' type=string items=1
value='ADATA USB Flash Drive'
value='ADATA'
value='ADATA'
name='sata' type=int items=1 dev=none
value='SATA AHCI 1.0 Interface'
dev_link=/dev/cfg/sata1/0
dev_link=/dev/cfg/sata1/1
name='ata-options' type=int items=1
value='atapi'
name='sata' type=int items=1 dev=none
value='\_SB_.PCI0.SATA'
value='SATA AHCI 1.0 Interface'
dev_link=/dev/cfg/sata0/0
dev_link=/dev/cfg/sata0/1
dev_link=/dev/cfg/sata0/2
dev_link=/dev/cfg/sata0/3
dev_link=/dev/cfg/sata0/4
dev_link=/dev/cfg/sata0/5

value='id1,sd@SATA_ST2000DL003-9VT15YD217ZL'
name='sata-phy' type=int items=1

value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 
'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass'


value='id1,sd@SATA_ST2000DL003-9VT15YD1XWWB'
name='sata-phy' type=int items=1

value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 
'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass'


value='id1,sd@SATA_ST2000DL003-9VT15YD1VLKC'
name='sata-phy' type=int items=1

value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 
'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass'


value='id1,sd@SATA_ST2000DL003-9VT15YD21QZL'
name='sata-phy' type=int items=1

value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 
'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass'


value='id1,sd@SATA_ST2000DL003-9VT15YD24GCA'
name='sata-phy' type=int items=1

value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 
'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass'


value='id1,sd@SATA_ST2000DL003-9VT15YD24GDG'
name='sata-phy' type=int items=1

value='scsiclass,00.vATA.pST2000DL003-9VT1.rCC32' + 
'scsiclass,00.vATA.pST2000DL003-9VT1' + 'scsiclass,00' + 'scsiclass'



This only sees the six ST2000DL003 drives of the main data pool,
and the LiveUSB flash drive...

So - is it possible to try reinitializing and locating connections to
the disk on a commodity motherboard (i.e. no lsiutil, IPMI and such)
using only OI, without rebooting the box?

The pools are not imported, so if I can detach and reload the sata
drivers - I might try that, but I am stumped at how

Re: [zfs-discuss] SSD for L2arc

2013-03-21 Thread Jim Klimov


On 2013-03-21 16:24, Ram Chander wrote:

Hi,

Can I know how to configure a SSD to be used for L2arc ? Basically I
want to improve read performance.


The man zpool page is quite informative on theory and concepts ;)

If your pool already exists, you can prepare the SSD (partition/slice
it) and:
# zpool add POOLNAME cache cXtYdZsS

Likewise, to add a ZIL device you can add a log device, either as
a single disk (slice) or as a mirror of two or more:
# zpool add POOLNAME log cXtYdZsS
# zpool add POOLNAME log mirror cXtYdZsS1 cXtYdZsS2



To increase write performance, will SSD for Zil help ? As I read on
forums, Zil is only used for mysql/transaction based writes. I have
regular writes only.


It may increase performance in two ways:

If you have any apps (including NFS, maybe VMs, iSCSI, etc. - not only
databases) that regularly issue synchronous writes - those which must
be stored on media (not just cached and queued) before the call returns
a success, then the ZIL catches these writes instead of the main pool
devices. The ZIL is written as ring buffer, so its size is proportional
to your pool's throughput - about 3 full-size TXG syncs should fit into
the designated ZIL space. That's usually max bandwidth (X Mb/s) times
15 sec (3*5s), or a bit more for peace of mind.

1) If the ZIL device (SLOG) is an SSD, it is presumably quick, so
writes should return quickly and sync IOs are less blocked.

2) If the SLOG is on HDD(s) separate from the main pool, then writes
into the ZIL cause no mechanical seeks during normal pool IOs, thus
requiring time for the disk heads to travel to the reserved ZIL area
and back - this is time stolen from both reads and writes in the pool.
*Possibly*, fragmentation might also be reduced by having ZIL outside
of the main pool, though this statement may be technically invalid as
my fault, then.

3) As a *speculation*, it is likely that a HDD doing nothing but SLOG
(i.e. a hotspare with a designated slice for ZIL so it does something
useful while waiting for failover of a larger pool device) would also
give a good boost to performance, since it won't have to seek much.
The rotational latency will be there however, limiting reachable IOPS
in comparison to an SSD SLOG.

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] System started crashing hard after zpool reconfigure and OI upgrade

2013-03-20 Thread Jim Klimov


On 2013-03-20 17:15, Peter Wood wrote:

I'm going to need some help with the crash dumps. I'm not very familiar
with Solaris.

Do I have to enable something to get the crash dumps? Where should I
look for them?


Typically the kernel crash dumps are created as a result of kernel
panic; also they may be forced by administrative actions like NMI.
They require you to configure a dump volume of sufficient size (see
dumpadm) and a /var/crash which may be a dataset on a large enough
pool - after the reboot the dump data will be migrated there.

To help with the hangs you can try the BIOS watchdog (which would
require a bmc driver, one which is known from OpenSolaris is alas
not opensourced and not redistributable), or with a software deadman
timer:

http://www.cuddletech.com/blog/pivot/entry.php?id=1044

http://wiki.illumos.org/display/illumos/System+Hangs

Also, if you configure crash dump on NMI and set up your IPMI card,
then you can likely gain remote access to both the server console
(physical and/or serial) and may be able to trigger the NMI, too.

HTH,
//Jim



Thanks for the help.


On Wed, Mar 20, 2013 at 8:53 AM, Michael Schuster
michaelspriv...@gmail.com mailto:michaelspriv...@gmail.com wrote:

How about crash dumps?

michael


On Wed, Mar 20, 2013 at 4:50 PM, Peter Wood peterwood...@gmail.com
mailto:peterwood...@gmail.com wrote:

I'm sorry. I should have mentioned it that I can't find any
errors in the logs. The last entry in /var/adm/messages is that
I removed the keyboard after the last reboot and then it shows
the new boot up messages when I boot up the system after the
crash. The BIOS log is empty. I'm not sure how to check the IPMI
but IPMI is not configured and I'm not using it.

Just another observation - the crashes are more intense the more
data the system serves (NFS).

I'm looking into FRMW upgrades for the LSI now.


On Wed, Mar 20, 2013 at 8:40 AM, Will Murnane
will.murn...@gmail.com mailto:will.murn...@gmail.com wrote:

Does the Supermicro IPMI show anything when it crashes?
  Does anything show up in event logs in the BIOS, or in
system logs under OI?


On Wed, Mar 20, 2013 at 11:34 AM, Peter Wood
peterwood...@gmail.com mailto:peterwood...@gmail.com wrote:

I have two identical Supermicro boxes with 32GB ram.
Hardware details at the end of the message.

They were running OI 151.a.5 for months. The zpool
configuration was one storage zpool with 3 vdevs of 8
disks in RAIDZ2.

The OI installation is absolutely clean. Just
next-next-next until done. All I do is configure the
network after install. I don't install or enable any
other services.

Then I added more disks and rebuild the systems with OI
151.a.7 and this time configured the zpool with 6 vdevs
of 5 disks in RAIDZ.

The systems started crashing really bad. They
just disappear from the network, black and unresponsive
console, no error lights but no activity indication
either. The only way out is to power cycle the system.

There is no pattern in the crashes. It may crash in 2
days in may crash in 2 hours.

I upgraded the memory on both systems to 128GB at no
avail. This is the max memory they can take.

In summary all I did is upgrade to OI 151.a.7 and
reconfigured zpool.

Any idea what could be the problem.

Thank you

-- Peter

Supermicro X9DRH-iF
Xeon E5-2620 @ 2.0 GHz 6-Core
LSI SAS9211-8i HBA
32x 3TB Hitachi HUS723030ALS640, SAS, 7.2K

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
Michael Schuster
http://recursiveramblings.wordpress.com/




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] partioned cache devices

2013-03-19 Thread Jim Klimov


On 2013-03-19 20:38, Cindy Swearingen wrote:

Hi Andrew,

Your original syntax was incorrect.

A p* device is a larger container for the d* device or s* devices.
In the case of a cache device, you need to specify a d* or s* device.
That you can add p* devices to a pool is a bug.


I disagree; at least, I've always thought differently:
the d device is the whole disk denomination, with a
unique number for a particular controller link (c+t).

The disk has some partitioning table, MBR or GPT/EFI.
In these tables, partition p0 stands for the table
itself (i.e. to manage partitioning), and the rest kind
of depends. In case of MBR tables, one partition may
be named as having a Solaris (or Solaris2) type, and
there it holds a SMI table of Solaris slices, and these
slices can hold legacy filesystems or components of ZFS
pools. In case of GPT, the GPT-partitions can be used
directly by ZFS. However, they are also denominated as
slices in ZFS and format utility.

I believe, Solaris-based OSes accessing a p-named
partition and an s-named slice of the same number
on a GPT disk should lead to the same range of bytes
on disk, but I am not really certain about this.

Also, if a whole disk is given to ZFS (and for OSes
other that the latest Solaris 11 this means non-rpool
disks), then ZFS labels the disk as GPT and defines a
partition for itself plus a small trailing partition
(likely to level out discrepancies with replacement
disks that might happen to be a few sectors too small).
In this case ZFS reports that it uses cXtYdZ as a
pool component, since it considers itself in charge
of the partitioning table and its inner contents, and
doesn't intend to share the disk with other usages
(dual-booting and other OSes' partitions, or SLOG and
L2ARC parts, etc). This also allows ZFS to influence
hardware-related choices, like caching and throttling,
and likely auto-expansion with the changed LUN sizes
by fixing up the partition table along the way, since
it assumes being 100% in charge of the disk.

I don't think there is a crime in trying to use the
partitions (of either kind) as ZFS leaf vdevs, even the
zpool(1M) manpage states that:

... The  following  virtual  devices  are supported:
  disk
A block device, typically located under  /dev/dsk.
ZFS  can  use  individual  slices  or  partitions,
though the recommended mode of operation is to use
whole  disks.  ...

This is orthogonal to the fact that there can only be
one Solaris slice table, inside one partition, on MBR.
AFAIK this is irrelevant on GPT/EFI - no SMI slices there.

On my old home NAS with OpenSolaris I certainly did have
MBR partitions on the rpool intended initially for some
dual-booted OSes, but repurposed as L2ARC and ZIL devices
for the storage pool on other disks, when I played with
that technology. Didn't gain much with a single spindle ;)

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] partioned cache devices

2013-03-19 Thread Jim Klimov


On 2013-03-19 22:07, Andrew Gabriel wrote:

The GPT partitioning spec requires the disk to be FDISK
partitioned with just one single FDISK partition of type EFI,
so that tools which predate GPT partitioning will still see
such a GPT disk as fully assigned to FDISK partitions, and
therefore less likely to be accidentally blown away.


Okay, I guess I got entangled in terminology now ;)
Anyhow, your words are not all news to me, though my write-up
was likely misleading to unprepared readers... sigh... Thanks
for the clarifications and deeper details that I did not know!

So, we can concur that GPT does indeed include the fake MBR
header with one EFI partition which addresses the smaller of
2TB (MBR limit) or disk size, minus a few sectors for the GPT
housekeeping. Inside the EFI partition are defined the GPT,
um, partitions (represented as slices in Solaris). This is
after all a GUID *Partition* Table, and that's how parted
refers to them too ;)

Notably, there are also unportable tricks to fool legacy OSes
and bootloaders into addressing the same byte ranges via both
MBR entries (forged manually and abusing the GPT/EFI spec) and
proper GPT entries, as partitions in the sense of each table.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zfs] Petabyte pool?

2013-03-16 Thread Jim Klimov


On 2013-03-16 15:20, Bob Friesenhahn wrote:

On Sat, 16 Mar 2013, Kristoffer Sheather @ CloudCentral wrote:


Well, off the top of my head:

2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's

That should fit within 1 rack comfortably and provide 1 PB storage..


What does one do for power?  What are the power requirements when the
system is first powered on?  Can drive spin-up be staggered between JBOD
chassis?  Does the server need to be powered up last so that it does not
time out on the zfs import?


I guess you can use managed PDUs like those from APC (many models for
varied socket types and amounts); they can be scripted on an advanced
level, and on a basic level I think delays can be just configured
per-socket to make the staggered startup after giving power from the
wall (UPS) regardless of what the boxes' individual power sources can
do. Conveniently, they also allow to do a remote hard-reset of hung
boxes without walking to the server room ;)

My 2c,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zfs] Petabyte pool?

2013-03-16 Thread Jim Klimov


On 2013-03-16 15:20, Bob Friesenhahn wrote:

On Sat, 16 Mar 2013, Kristoffer Sheather @ CloudCentral wrote:


Well, off the top of my head:

2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's

That should fit within 1 rack comfortably and provide 1 PB storage..


What does one do for power?  What are the power requirements when the
system is first powered on?  Can drive spin-up be staggered between JBOD
chassis?  Does the server need to be powered up last so that it does not
time out on the zfs import?



Giving this question a second thought, I think JBODs should spin-up
quickly (i.e. when power is given) while the server head(s) take time
to pass POST, initialize their HBAs and other stuff. Booting 8 JBODs,
one every 15 seconds to complete a typical spin-up power draw, would
take a couple of minutes. It is likely that a server booted along with
the first JBOD won't get to importing the pool this quickly ;)

Anyhow, with such a system attention should be given to redundant power
and cooling, including redundant UPSes preferably fed from different
power lines going into the room.

This does not seem like a fantastic power sucker, however. 480 drives at 
15W would consume 7200W; add a bit for processor/RAM heads (perhaps

a kW?) and this would still fit into 8-10kW, so a couple of 15kVA UPSes
(or more smaller ones) should suffice including redundancy. This might
overall exceed a rack in size though. But for power/cooling this seems
like a standard figure for a 42U rack or just a bit more.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun X4200 Question...

2013-03-14 Thread Jim Klimov


On 2013-03-11 21:50, Bob Friesenhahn wrote:

On Mon, 11 Mar 2013, Tiernan OToole wrote:


I know this might be the wrong place to ask, but hopefully someone can
point me in the right direction...
I got my hands on a Sun x4200. Its the original one, not the M2, and
has 2 single core Opterons, 4Gb RAM and 4 73Gb SAS Disks...
But, I dont know what to install on it... I was thinking of SmartOS,
but the site mentions Intel support for VT, but nothing for
AMD... The Opterons dont have VT, so i wont be using XEN, but the
Zones may be useful...


OpenIndiana or OmniOS seem like the most likely candidates.

You can run VirtualBox on OpenIndiana and it should be able to work
without VT extensions.


Also note that without the extensions VirtualBox has some quirks.
Most notably, lack of acceleration and support for virtual SMP.
But unlike some other virtualizers, it should work (does work for
us on a Thumper also with pre-VTx Opteron CPUs). However, recently
the VM virtual hardware clocks became way slow. I am at loss so
far, the forum was moderately helpful - probably the load on the
host and induced latencies have their role. But the problem does
happen on more modern hardware too, so VTx (lack of) shouldn't be
our reason...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun X4200 Question...

2013-03-14 Thread Jim Klimov


On 2013-03-15 01:58, Gary Driggs wrote:

On Mar 14, 2013, at 5:55 PM, Jim Klimov jimkli...@cos.ru wrote:


However, recently the VM virtual hardware clocks became way slow.


Does NTP help correct the guest's clock?


Unfortunately no, neither guest NTP, ntpdate or rdate in crontabs,
nor VirtualBox timesync settings, alone or even combined for test
(though known to conflict) - nothing has definitely helped so far.

We also have some setups on rather not-loaded hardware where after
a few days of uptime the clock stalls to the point that it has a
groundhog day - rotating over the same 2-3 second range for hours,
until the VM is powered off and booted.

Conversely, we also have dozens of VMs (and a few hosts) where no
such problems occur. Weird stuff...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SVM ZFS

2013-02-26 Thread Jim Klimov


On 2013-02-26 21:30, Morris Hooten wrote:

Besides copying data from /dev/md/dsk/x volume manager filesystems to
new zfs filesystems
does anyone know of any zfs conversion tools to make the
conversion/migration from svm to zfs
easier?


Do you mean something like a tool that would change metadata around
your userdata in-place and turn an SVM volume into a ZFS pool, like
Windows' built-in FAT - NTFS conversion? No, there's nothing like it.

However, depending on your old system's configuration, you might have
to be careful about choice of copy programs. Namely, if your setup
used some ACLs (beyond standard POSIX access bits), then you'd need
ACL-aware copying tools. Sun tar and cpio are some (see manpages about
usage examples), rsync 3.0.10 was recently reported to support Solaris
ACLs as well, but I didn't test that myself. GNU tar and cpio are known
to do a poor job with intimate Solaris features, though they might be
superior for some other tasks. Basic (Sun, not GNU) cp and mv should
work correctly too.

I most often use rsync -avPHK /src/ /dst/, especially if there are
no ACLs to think about, or the target's inheritable ACLs are acceptable
(and overriding them with original's access rights might even be wrong).

Also, before you do the migration, think ahead of the storage and IO
requirements for the datasets. For example, log files are often huge,
compress into orders of magnitude less, and the IOPS loss might be
negligible (or even boost, due to smaller hardware IOs and less seeks).
Randomly accessed (written) data might not like heavier compressions.
Databases or VM images might benefit from smaller maximum block sizes,
although often these are not made 1:1 with DB block size, but rather
balance about 4 DB entries in an FS block of 32Kb or 64Kb (from what
I saw suggested on the list).

Singly-written data, like OS images, might benefit from compression as
well. If you have local zones, you might benefit from carrying over
(or installing from scratch) one as a typical example DUMMY into a
dedicated dataset, then cloning it into many actual zone roots as you'd
need, and rsync -cavPHK --delete-after from originals into this
dataset - this way only differing files (or parts thereof) would be
transferred, giving you the benefits of cloning (space saving) without
the downsides of deduplication.

Also, for data in the zones (such as database files, tomcat/glassfish
application server roots, etc.) you might like to use separate dataset
hierarchies mounted via delegation of a root ZFS dataset into zones.
This way your zoneroots would live a separate life from application
data and non-packaged applications, which might simplify backups, etc.
and you might be able to store these pieces in different pools (i.e.
SSDs for some data and HDDs for other - though most list members would
rightfully argue in favor of L2ARC on the SSDs).

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SVM ZFS

2013-02-26 Thread Jim Klimov


Ah, I forgot to mention - ufsdump|ufsrestore was at some time also
a recommended way of such transition ;)

I think it should be aware of all intimacies of the FS, including
sparse files which reportedly may puzzle some other archivers.
Although with any sort of ZFS compression (including lightweight
zle) zero-filled blocks should translate into zero IOs. (Maybe
some metadata would appear, to address the holes, however).
With proper handling of sparse files you don't write any of that
voidness into the FS and you don't process anything on reads.

Have fun,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Distro Advice

2013-02-26 Thread Jim Klimov


On 2013-02-27 05:36, Ian Collins wrote:

Bob Friesenhahn wrote:

On Wed, 27 Feb 2013, Ian Collins wrote:

I am finding that rsync with the right options (to directly
block-overwrite) plus zfs snapshots is providing me with pretty
amazing deduplication for backups without even enabling
deduplication in zfs.  Now backup storage goes a very long way.

We do the same for all of our legacy operating system backups. Take a
snapshot then do an rsync and an excellent way of maintaining
incremental
backups for those.

Magic rsync options used:

-a --inplace --no-whole-file --delete-excluded

This causes rsync to overwrite the file blocks in place rather than
writing to a new temporary file first.  As a result, zfs COW produces
primitive deduplication of at least the unchanged blocks (by writing
nothing) while writing new COW blocks for the changed blocks.


Do these options impact performance or reduce the incremental stream sizes?

I just use -a --delete and the snapshots don't take up much space
(compared with the incremental stream sizes).




Well, to be certain, you can create a dataset with a large file in it,
snapshot it, and rsync over a changed variant of the file, snapshot and
compare referenced sizes. If the file was rewritten into a new temporary
one and then renamed over original, you'd likely end up with as much
used storage as for the original file. If only changes are written into
it in-place then you'd use a lot less space (and you'd not see a
.garbledfilename in the directory during the process).

If you use rsync over network to back up stuff, here's an example of
SMF wrapper for rsyncd, and a config sample to make a snapshot after
completion of the rsync session.

http://wiki.openindiana.org/oi/rsync+daemon+service+on+OpenIndiana

HTH,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?

2013-02-21 Thread Jim Klimov


On 2013-02-20 23:49, Markus Grundmann wrote:

add an pool / filesystem property as an additional security layer for
administrators.

Whenever I modify zfs pools or filesystems it's possible to destroy [on
a bad day :-)] my data. A new
property protected=on|off in the pool and/or filesystem can help the
administrator for datalost
(e.g. zpool destroy tank or zfs destroy tank/filesystem command
will be rejected
when protected=on property is set).



Hello all,

  I don't want to really hijack this thread, but this request seems
like a nice complement to one I voiced a few times and recently posted
into the bugtracker lest it be forgotten:

Feature #3568: Add a ZFS dataset attribute to disallow creation of
snapshots, ever: https://www.illumos.org/issues/3568

  It is somewhat of an opposite desire - to not allow creation of
datasets (snapshots) rather than forbid their destruction as requested
here, but to a similar effect: to not let some scripted or thoughtless
manual jobs abuse the storage by wasting space in some datasets in the
form of snapshot creation.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?

2013-02-21 Thread Jim Klimov


On 2013-02-21 16:54, Markus Grundmann wrote:

It's anyone here on the list that's have some tips for me what files are
to modify ? :-)

In my current source tree now is a new property PROTECTED available
both for pool- und zfs-objects. I have also two functions added to
get and set the property above. The source code tree is very big and
some files have the same name in different locations. GREP seems to be
my new friend.


You might also benefit from on-line grepping here:

http://src.illumos.org/source/search?q=zfs_do_holddefs=refs=path=hist=project=freebsd-head

There is a project freebsd-head in illumos codebase; I have no idea
how actual it is for the BSD users.

HTH,
//Jim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot destroy, volume is busy

2013-02-21 Thread Jim Klimov


On 2013-02-21 17:02, John D Groenveld wrote:

# zfs list -t vol
NAME   USED  AVAIL  REFER  MOUNTPOINT
rpool/dump4.00G  99.9G  4.00G  -
rpool/foo128  66.2M   100G16K  -
rpool/swap4.00G  99.9G  4.00G  -

# zfs destroy rpool/foo128
cannot destroy 'rpool/foo128': volume is busy


Can anything local be holding it (databases, virtualbox, etc)?
Can there be any clones, held snapshots or an ongoing zfs send?
(Perhaps an aborted send left a hold?)

Sometimes I have had a bug with a filesystem dataset becoming so busy
that I couldn't snapshot it. Unmounting and mounting it back usually
helped. This was back in the days of SXCE snv_117 and Solaris 10u8,
and the bug often popped up in conjunction with LiveUpgrade. I believe
this particular issue was solved since, but maybe something new like it
has appeared?..

Hopefully some on-list gurus might walk you through use of a debugger
or dtrace to track which calls are being made by zfs destroy and lead
it to conclude that the dataset is busy?.. I really only know to use
truss -f -l progname params which helps most of the time, and would
love to learn the modern equivalents which give more insights into code.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raid1 error resilvering and mount

2013-02-19 Thread Jim Klimov


On 2013-02-19 12:39, Konstantin Kuklin wrote:

i did`t replace disk, after reboot system not started (zfs installed
as default root system) and i boot from another system(from flash) and
resilvering has auto start and show me warnings with freeze
progress(dead on checking zroot/var/crash )


Well, in this case try again with zpool import options I've described
earlier, and zpool scrub to try to inspect and repair the pool state
you have now. You might want to disconnect the broken disk for now,
since resilvering would try to overwrite it anyway (whole disk, or just
differences if it is found to have a valid label ending at an earlier
TXG number).


replacing dead disk healing var/crash with 0x0 adress?


Probably not, since your pool's only copy has an error in it. 0x0 is a
metadata block (dataset root or close to that), so an error in it is
usually fatal (is for most dataset types). Possibly, an import with
rollback can return your pool to state where another blockpointer tree
version points to a different (older) block as this dataset's 0x0 and
that would be valid. But if you've already imported the pool and it
ran for a while, chances are that your older possibly better intact
TXGs are no longer referencable (rolled out of the ring buffer forever).

Good luck,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raid1 error resilvering and mount

2013-02-19 Thread Jim Klimov


On 2013-02-19 14:24, Konstantin Kuklin wrote:

zfs set canmount=off zroot/var/crash

i can`t do this, because zfs list empty



I'd argue that in your case it might be desirable to evacuate data and 
reinstall the OS - just to be certain that ZFS on-disk structures on

new installation have no defects.

To evacuate data, a read-only import would suffice:

# zpool import -f -N -R /a -o ro zroot

This should import the pool without mounting its datasets (-N).
Using zfs mount zpool/ROOT/myrootfsname and so on you can mount just
the datasets which hold your valuable data individually (under '/a' in
this example), and rsync it to some other storage.

After you've saved your data, you can try to repair the pool by roll
back:

# zpool export zpool
# zpool import -F -f -N -R /a zroot

This should try to roll back 10 transaction sets or so, possibly giving
you an intact state of ZFS data structures and a usable pool. Maybe not.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raid1 error resilvering and mount

2013-02-19 Thread Jim Klimov


On 2013-02-19 17:02, Victor Latushkin wrote:

On 2/19/13 6:32 AM, Jim Klimov wrote:

On 2013-02-19 14:24, Konstantin Kuklin wrote:

zfs set canmount=off zroot/var/crash

i can`t do this, because zfs list empty



I'd argue that in your case it might be desirable to evacuate data and
reinstall the OS - just to be certain that ZFS on-disk structures on
new installation have no defects.

To evacuate data, a read-only import would suffice:


This is a good idea but ..



# zpool import -f -N -R /a -o ro zroot


This command will not achieve readonly import.

For readonly import one needs to use 'zpool import -o readonly=on
poolname' as 'zpool import -o ro poolname' will import in R/W mode
and just mount filesystems readonly.


Oops, my bad. Do what the guru says! Really, I was mistaken
in this fast-typing ;)


Feel free to add other options (-f, -N, etc) as needed.




This should import the pool without mounting its datasets (-N).
Using zfs mount zpool/ROOT/myrootfsname and so on you can mount just
the datasets which hold your valuable data individually (under '/a' in
this example), and rsync it to some other storage.

After you've saved your data, you can try to repair the pool by roll
back:

# zpool export zpool
# zpool import -F -f -N -R /a zroot

This should try to roll back 10 transaction sets or so, possibly giving
you an intact state of ZFS data structures and a usable pool. Maybe not.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--


++
||
| Климов Евгений, Jim Klimov |
| технический директор   CTO |
| ЗАО ЦОС и ВТ  JSC COSHT |
||
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru |
|CC:ad...@cos.ru,jimkli...@gmail.com |
++
| ()  ascii ribbon campaign - against html mail  |
| /\- against microsoft attachments  |
++



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raid1 error resilvering and mount

2013-02-17 Thread Jim Klimov


On 2013-02-17 15:46, Konstantin Kuklin wrote:

hi, i have raid1 on zfs with 2 device on pool
first device died and boot from second not working...


You didn't say which OS version created the pool (ultimately -
which pool version is there) and I'm not sure about support of
the zfs versions in that flash you linked to. Possibly, OI LiveCD
might do you a better job - but maybe your disks got too corrupted
in some cataclysm :(

However, generally, recent implementations should have several
useful zpool import flags:
* forcing an import with rollback to an older pool state (-F) -
  which may be or not be more intact (up to 32 or 128 transactions);
* import without automount (-N)
* read-only import (-o ro) which should panic in a lot less cases and
  allows to evacuate readable data by at least cp/rsync
* import without cachefile and/or relocated pool root mountpoint
  (-R /a) so as to, in particular, not damage the namespace of
  your system by this pool (not really relevant in case of livecd's)

Hopefully, you can either import without mounts and issue a zfs 
destroy of your offending dataset, or rollback (irreversible) to

a working state. However, it is also possible that the corruption
is among metadata. If you're lucky and just the latest transaction
got broken during the crash (i.e. disk firmware ignored queuing and
caching hints, and wrote something out of order), then rollback by
one or a few TXGs may point you to an older root of metadata tree
which is not yet overwritten by newer transactions (note: this is
not guaranteed by the OS, just probable) and does contain consistent
metadata in at least one copy of each of the metadata blocks.

Breakage in /var/crash remotely suggests that your system tried to
either create a dump (kernel panic) or more likely process one (via
savecore in case of Solaris), and failed during this procedure in
a mid-write.



i try to get http://mfsbsd.vx.sk/ flash and load from it with zpool import
http://puu.sh/2402E

when  i load zfs.ko and opensolaris.ko i see this message:
Solaris: WARNING: Can't open objset for zroot/var/crash
Solaris: WARNING: Can't open objset for zroot/var/crash

zpool status:
http://puu.sh/2405f

resilvering freeze with:
zpool status -v
 .
 zroot/usr:0x28ff
 zroot/usr:0x29ff
 zroot/usr:0x2aff
 zroot/var/crash:0x0
  root@Flash:/root #

how i can delete or drop it fs zroot/var/crash (1m-10m size i didn`t
remember) and mount other zfs points with my data
--
С уважением
Куклин Константин.


Good luck,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raid1 error resilvering and mount

2013-02-17 Thread Jim Klimov


Also, adding to my recent post: instead of resilvering, try to run
zpool scrub first - it should verify all checksums and repair
whatever it can via redundancy (for metadata - extra copies).

Resilver is similar to scrub, but it has its other goals and
implementation, and might be not so forgiving about pool errors.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] HELP! RPool problem

2013-02-16 Thread Jim Klimov


On 2013-02-16 21:49, John D Groenveld wrote:

By the way, whatever the error message is when booting, it disapears so
quickly I can't read it, so I am only guessing that this is the reason.


Boot with kernel debugger so you can see the panic.

And that would be so:
1) In the boot loader (GRUB) edit the boot options (press e,
   select kernel line, press e again), and add -kd to the
   kernel bootup. Maybe also -v to add verbosity.

2) Press enter to save the change and b to boot

3) The kmdb prompt should pop up; enter :c to continue execution
   The bootup should start, throw the kernel panic and pause.
   It is likely that there would be so much info that it doesn't
   fit on screen - I can only suggest a serial console in this case.

   However, the end of dump info should point you in the right
   direction. For example, an error in mount_vfs_root is popular,
   and usually means either corrupt media or simply unexpected device
   name for the root pool (i.e. disk plugged on a different port, or
   BIOS changes between SATA-IDE modes, etc.)

The device name changes should go away if you can boot from anything
that can import your rpool (livecd, installer cd, failsafe boot image)
and just zpool import -f rpool; zpool export rpool - this should
clear the dependency on exact device names, and next bootup should
work.

And yes, I think it is a bug for such a fixable problem to behave so
inconveniently - the official docs go as far as to suggest an OS
reinstallation in this case.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL

2013-02-16 Thread Jim Klimov


Hello Cindy,

Are there any plans to preserve the official mailing lists' archives,
or will they go the way of Jive forums and the future digs for bits
of knowledge would rely on alternate mirrors and caches?

I understand that Oracle has some business priorities, but retiring
hardware causes site shutdown? They've gotta be kidding, with all
the buzz about clouds and virtualization ;)

I'd guess, you also are not authorized to say whether Oracle might
permit re-use (re-hosting) of current OpenSolaris.Org materials or
even give away the site and domain for community steering and rid
itself of more black PR by shooting down another public project of
the Sun legacy (hint: if the site does wither and die in community's
hands - it is not Oracle's fault; and if it lives on - Oracle did
something good for karma... win-win, at no price).

Thanks for your helpfulness in the past years,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow zfs writes

2013-02-12 Thread Jim Klimov


On 2013-02-12 10:32, Ian Collins wrote:

Ram Chander wrote:


Hi Roy,
You are right. So it looks like re-distribution issue. Initially there
were two Vdev with 24 disks ( disk 0-23 ) for close to year. After
which  which we added 24 more disks and created additional vdevs. The
initial vdevs are filled up and so write speed declined. Now  how to
find files that are present in a Vdev or a disk. That way I can remove
and re-copy back to distribute data. Any other way to solve this ?


The only way is to avoid the problem in the first place by not mixing
vdev sizes in a pool.





Well, that disbalance is there - in the zpool status printout we see
raidz1 top-level vdevs of size 5, 5, 12, 7, 7, 7 disks and some 5 spares 
- which seems to sum up to 48 ;)


Depending on disk size, it might be possible that tlvdev sizes in
gigabytes were kept the same (i.e. a raidz set with twice as many
disks of half size), but we have no info on this detail and it is
unlikely. The disk sets being in one pool, this would still quite
disbalance the load among spindles and IO buses.

Beside all that - with the older tlvdev's being more full than
the newer ones, there is the disbalance which wouldn't be avoided
by not mixing vdev sizes - writes into newer ones are more likely
to quickly find available holes, while writes into older ones are more 
fragmented and longer data inspection is needed to find a hole -

if not even the gang-block fragmentation. These two are, I believe,
the basis for performance drop on full pools, with the measure
being rather the mix of IO patterns and fragmentation of data and
holes.

I think there were developments in illumos ZFS to address more
writes onto devices with more available space; I am not sure if
the average write latency to a tlvdev was monitored and taken
into account during write-targeting decisions (which would also
wrap the case of failing devices which take longer to respond).
I am not sure which portions nave been completed and integrated
into common illumos-gate.

As was suggested, you can use zpool iostat -v 5 to monitor IOs
to the pool with a fanout per TLVDEV and per disk, and witness
possible patterns there. Do keep in mind, however, that for a
non-failed raidz set you should see reads from only the data
disks for a particular stripe, while parity disks are not used
unless a checksum mismatch occurs. On the average data should
be on all disks in such a manner that there is no dedicated
parity disk, but with small IOs you are likely to notice this.

If the budget permits, I'd suggest building (or leasing) another
system with balanced disk sets and replicating all data onto it,
then repurposing the older system - for example, to be a backup
of the newer box (also after remaking the disk layout).

As for the question of which files are on the older disks -
you can as a rule of thumb use the file creation/modification
time in comparison with the date when you expanded the pool ;)
Closer inspection could be done with a ZDB walk to print out
the DVA block addresses for blocks of a file (the DVA includes
the number of the top-level vdev), but that would take some
time - to determine which files you want to expect (likely
some band of sizes) and then to do these zdb walks.

Good luck,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS monitoring

2013-02-11 Thread Jim Klimov


On 2013-02-11 17:14, Borja Marcos wrote:


On Feb 11, 2013, at 4:56 PM, Tim Cook wrote:


The zpool iostat output has all sorts of statistics I think would be 
useful/interesting to record over time.



Yes, thanks :) I think I will add them, I just started with the esoteric ones.

Anyway, still there's no better way to read it than running zpool iostat and 
parsing the output, right?



I believe, in this case you'd have to run it as a continuous process
and parse the outputs after the first one (overall uptime stat, IIRC).
Also note that on problems with ZFS engine itself, zpool may lock up
and thus halt your program - so have it ready to abort an outstanding
statistics read after a timeout and perhaps log an error.

And if pools are imported-exported during work, the zpool iostat
output changes dynamically, so you basically need to parse its text
structure every time.

The zpool iostat -v might be even more interesting though, as it lets
you see per-vdev statistics and perhaps notice imbalances, etc...

All that said, I don't know if this data isn't also available as some
set of kstats - that would probably be a lot better for your cause.
Inspect the zpool source to see where it gets its numbers from...
and perhaps make and RTI relevant kstats, if they aren't yet there ;)

On the other hand, I am not certain how Solaris-based kstats interact
or correspond to structures in FreeBSD (or Linux for that matter)?..

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-10 Thread Jim Klimov


On 2013-02-10 10:57, Datnus wrote:

I run dd if=/dev/zero of=testfile bs=1024k count=5 inside the iscsi vmfs
from ESXi and rm textfile.

However, the zpool list doesn't decrease at all. In fact, the used storage
increase when I do dd.

FreeNas 8.0.4 and ESXi 5.0
Help.
Thanks.


Did you also enable compression (any non-off kind) for the ZVOL
which houses your iSCSI volume?

The procedure with zero-copying does allocate (logically) the blocks
requested in the sparse volume. If this volume is stored on ZFS with
compression (active at the moment when you write these blocks), then
ZFS detects an all-zeroes blocks and uses no space to store it, only
adding a block pointer entry to reference its emptiness. This way you
get some growth in metadata, but none in userdata for the volume.
If by doing this trick you overwrite the non-empty but logically
deleted blocks in the VM's filesystem housed inside iSCSI in the
ZVOL, then the backend storage should shrink by releasing those
non-empty blocks. Ultimately, if you use snapshots - those released
blocks would be reassigned into the snapshots of the ZVOL; and so in
order to get usable free space on your pool, you'd have to destroy
all those older snapshots (between creation and deletion times of
those no-longer-useful blocks).

If you have reservations about compression for VMs (performance-wise
or somehow else), take a look at zle compression mode which should
only reduce consecutive strings of zeroes.

Also I'd reiterate - the compression mode takes effect for blocks
written after the mode was set. For example, if you prefer to store
your datasets generally uncompressed for any reason, then you can
enable a compression mode, zero-fill the VM disk's free space as
you did, and re-disable the compression for the volume for any
further writes. Also note that if you zfs send or otherwise copy
the data off the dataset into another (backup one), only the one
compression method last defined for the target dataset would be
applied to the new writes into it - regardless of absence or
presence (and type) of compression on the original dataset.

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] how to know available disk space

2013-02-08 Thread Jim Klimov

On 2013-02-08 22:47, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

Maybe this isn't exactly what you need, but maybe:

for fs in `zfs list -H -o name` ; do echo $fs ; zfs get 
reservation,refreservation,usedbyrefreservation $fs ; done


What is the sacramental purpose of such construct in comparison to:

zfs list -H -o reservation,refreservation,usedbyrefreservation,name \
  -t filesystem {-r pool/interesting/dataset}

Just asking - or suggesting a simpler way to do stuff ;)

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub performance

2013-02-04 Thread Jim Klimov

On 2013-02-04 15:52, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

I noticed that sometimes I had terrible rates with  10MB/sec. Then
later it rose up to  70MB/sec.

Are you talking about scrub rates for the complete scrub?  Because if you sit 
there and watch it, from minute to minute, it's normal for it to bounce really 
low for a long time, and then really high for a long time, etc.  The only 
measurement that has any real meaning is time to completion.


To paraphrase, the random IOs on HDDs are slow - these are multiple
reads of small blocks dispersed on the disk, be it small files or
copies of metadata or seeks into the DDT. Fast reads are large
sequentially stored files, i.e. when a scrub hits an ISO image or
a movie on your disk, or a series of smaller files from the same
directory than happened to be created and saved in the same TXG
or so, and their userdata was queued to disk as a large sequential
blob in a coalesced write operation.

HTH,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub performance

2013-02-04 Thread Jim Klimov


On 2013-02-04 17:10, Karl Wagner wrote:

OK then, I guess my next question would be what's the best way to
undedupe the data I have?

Would it work for me to zfs send/receive on the same pool (with dedup
off), deleting the old datasets once they have been 'copied'? I think I
remember reading somewhere that the DDT never shrinks, so this would not
work, but it would be the simplest way.

Otherwise, I would be left with creating another pool or destroying and
restoring from a backup, neither of which is ideal.


If you have enough space, then copying with dedup=off should work
(zfs send, rsync, whatever works for you best).

I think DDT should shrink, deleting entries as soon as their reference
count goes to 0, however this by itself can take quite a while and
cause lots of random IO - in my case this might have been reason for
system hangs and/or panics due to memory starvation. However, after
a series of reboots (and a couple of weeks of disk-thrashing) I was
able to get rid of some more offending datasets in my tests a couple
of years ago now...

As for smarter undedup - I've asked recently, proposing a method
to do it in a stone-age way; but overall there is no ready solution
so far.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-24 Thread Jim Klimov


On 2013-01-24 11:06, Darren J Moffat wrote:


On 01/24/13 00:04, Matthew Ahrens wrote:

On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat
darr...@opensolaris.org mailto:darr...@opensolaris.org wrote:

Preallocated ZVOLs - for swap/dump.


Darren, good to hear about the cool stuff in S11.


Yes, thanks, Darren :)


Just to clarify, is this preallocated ZVOL different than the
preallocated dump which has been there for quite some time (and is in
Illumos)?  Can you use it for other zvols besides swap and dump?


It is the same but we are using it for swap now too.  It isn't available
for general use.


Some background:  the zfs dump device has always been preallocated
(thick provisioned), so that we can reliably dump.  By definition,
something has gone horribly wrong when we are dumping, so this code path
needs to be as small as possible to have any hope of getting a dump.  So
we preallocate the space for dump, and store a simple linked list of
disk segments where it will be stored.  The dump device is not COW,
checksummed, deduped, compressed, etc. by ZFS.


Comparing these two statements, can I say (and be correct) that the
preallocated swap devices would lack COW (as I proposed too) and thus
likely snapshots, but would also lack the checksums? (we might live
without compression, though that was once touted as a bonus for swap
over zfs, and certainly can do without dedup)

Basically, they are seemingly little different from preallocated
disk slices - and for those an admin might have better control over
the dedicated disk locations (i.e. faster tracks in a small-seek
stroke range), except that ZFS datasets are easier to resize...
right or wrong?

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Jim Klimov


On 2013-01-23 09:41, casper@oracle.com wrote:

Yes and no: the system reserves a lot of additional memory (Solaris
doesn't over-commits swap) and swap is needed to support those
reservations.  Also, some pages are dirtied early on and never touched
again; those pages should not be kept in memory.



I believe, by the symptoms, that this is what happens often
in particular to Java processes (app-servers and such) - I do
regularly see these have large VM sizes and much (3x) smaller
RSS sizes. One explanation I've seen is that JVM nominally
depends on a number of shared libraries which are loaded to
fulfill the runtime requirements, but aren't actively used and
thus go out into swap quickly. I chose to trust that statement ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov


On 2013-01-22 14:29, Darren J Moffat wrote:

Preallocated ZVOLs - for swap/dump.


Sounds like something I proposed on these lists, too ;)
Does this preallocation only mean filling an otherwise ordinary
ZVOL with zeroes (or some other pattern) - if so, to what effect?

Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov


On 2013-01-22 23:03, Sašo Kiselkov wrote:

On 01/22/2013 10:45 PM, Jim Klimov wrote:

On 2013-01-22 14:29, Darren J Moffat wrote:

Preallocated ZVOLs - for swap/dump.


Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?


I highly doubt it, as it breaks one of the fundamental design principles
behind ZFS (always maintain transactional consistency). Also,
contiguousness and compression are fundamentally at odds (contiguousness
requires each block to remain the same length regardless of contents,
compression varies block length depending on the entropy of the contents).


Well, dump and swap devices are kind of special in that they need
verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.
Both have a lifetime span of a single system uptime - like L2ARC,
for example - and will be reused anew afterwards - after a reboot,
a power-surge, or a kernel panic.

So while metadata used to address the swap ZVOL contents may and
should be subject to common ZFS transactions and COW and so on,
and jump around the disk along with rewrites of blocks, the ZVOL
userdata itself may as well occupy the same positions on the disk,
I think, rewriting older stuff. With mirroring likely in place as
well as checksums, there are other ways than COW to ensure that
the swap (at least some component thereof) contains what it should,
even with intermittent errors of some component devices.

Likewise, swap/dump breed of zvols shouldn't really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)

Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?

If the latter, and we still intend to preallocate and guarantee
that the swap has its administratively predefined amount of
gigabytes, compressed blocks can be aligned on those starting
locations as if they were not compressed. In effect this would
just decrease the bandwidth requirements, maybe.

For dump this might be just a bulky compressed write from start
to however much it needs, within the preallocated psize limits...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov


On 2013-01-22 23:32, Nico Williams wrote:

IIRC dump is special.

As for swap... really, you don't want to swap.  If you're swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.


I know of this stance, and in general you're right. But... ;)

Sometimes, there are once-in-a-longtime tasks that might require
enormous virtual memory that you wouldn't normally provision
proper hardware for (RAM, SSD) and/or cases when you have to run
similarly greedy tasks on hardware with limited specs (i.e. home
PC capped at 8GB RAM). As an example I might think of a ZDB walk
taking about 35-40GB VM on my box. This is not something I do
every month, but when I do - I need it to complete regardless
that I have 5 times less RAM on that box (and kernel's equivalent
of that walk fails with scanrate hell because it can't swap, btw).

On another hand, there are tasks like VirtualBox which require
swap to be configured in amounts equivalent to VM RAM size, but
don't really swap (most of the time). Setting aside SSDs for this
task might be too expensive, if they are never to be used in real
practice.

But this point is more of a task for swap device tiering (like
with Linux swap priorities), as I proposed earlier last year...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Jim Klimov


The discussion gets suddenly hot and interesting - albeit quite diverged
from the original topic ;)

First of all, as a disclaimer, when I have earlier proposed such changes
to datasets for swap (and maybe dump) use, I've explicitly proposed that
this be a new dataset type - compared to zvol and fs and snapshot that
we have today. Granted, this distinction was lost in today's exchange
of words, but it is still an important one - especially since it means
that while basic ZFS (or rather ZPOOL) rules are maintained, the dataset
rules might be redefined ;)

I'll try to reply to a few points below, snipping a lot of older text.

 Well, dump and swap devices are kind of special in that they need

verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.


I get your point, but I would argue that if you are willing to
preallocate storage for these, then putting dump/swap on an iSCSI LUN as
opposed to having it locally is kind of pointless anyway. Since they are
used rarely, having them thin provisioned is probably better in a
iSCSI environment than wasting valuable network-storage resources on
something you rarely need.


I am not sure what in my post led you to think that I meant iSCSI
or otherwise networked storage to keep swap and dump. Some servers
have local disks, you know - and in networked storage environments
the local disks are only used to keep the OS image, swap and dump ;)


Besides, if you plan to shred your dump contents after
reboot anyway, why fat-provision them? I can understand swap, but dump?


Guarantee that the space is there... Given the recent mischiefs
with dumping (i.e. the context is quite stripped compared to the
general kernel work, so multithreading broke somehow) I guess that
pre-provisioned sequential areas might also reduce some risks...
though likely not - random metadata would still have to get into
the pool.


You don't understand, the transactional integrity in ZFS isn't just to
protect the data you put in, it's also meant to protect ZFS' internal
structure (i.e. the metadata). This includes the layout of your zvols
(which are also just another dataset). I understand that you want to
view a this kind of fat-provisioned zvol as a simple contiguous
container block, but it is probably more hassle to implement than it's
worth.


I'd argue that transactional integrity in ZFS primarily protects
metadata, so that there is a tree of always-actual block pointers.
There is this octopus of a block-pointer tree whose leaf nodes
point to data blocks - but only as DVAs and checksums, basically.
Nothing really requires data to be or not be COWed and stored at
a different location than the previous version of the block at
the same logical offset for the data consumers (FS users, zvol
users), except that we want that data to be readable even after
a catastrophic pool close (system crash, poweroff, etc.).

We don't (AFAIK) have such a requirement for swap. If the pool
which contained swap kicked the bucket, we probably have a
larger problem whose solution will likely involve reboot and thus
recycling of all swap data.

And for single-device errors with (contiguous) preallocated
unrelocatable swap, we can protect with mirrors and checksums
(used upon read, within this same uptime that wrote the bits).




Likewise, swap/dump breed of zvols shouldn't really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)


If you are talking about the standard opensolaris-style
boot-environments, then yes, this is taken into account. Your BE lives
under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump
respectively (both thin-provisioned, since they are rarely needed).


I meant the attribute for zfs-auto-snapshots service, i.e.:
rpool/swap  com.sun:auto-snapshot  false  local

As I wrote, I'd argue that for new swap (and maybe dump) datasets
the snapshot action should not even be implemented.




Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?


Since compression occurs way below the dataset layer, your zvol capacity
doesn't change with compression, even though how much space it actually
uses in the pool can. A zvol's capacity pertains to its logical
attributes, i.e. most importantly the maximum byte offset within it
accessible to an application (in this case, swap). How the underlying
blocks are actually stored and how much space they take up is up to the
lower layers.

...

But you forget that a compressed block's physical size fundamentally
depends on its contents. That's why compressed zvols still appear the
same size as before. What changes is how much space they occupy on the
underlying pool.


I won't argue with this, as it is perfectly correct for zvols and
undefined for the

Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-21 Thread Jim Klimov


On 2013-01-21 07:06, Stephan Budach wrote:

Are there switch stats on whether it has seen media errors?

Has anybody gotton QLogic's SanSurfer to work with anything newer than
Java 1.4.2? ;) I checked the logs on my switches and they don't seem to
indicate such issues, but I am lacking the real-time monitoring that the
old SanSurfer provides.


I don't know what that is except by your message's context, but can't
you install JDK 1.4.2 on your system or in a VM, and set up a script
or batch file to launch the SanSurfer with the specific JAVA_HOME and
PATH values? ;)

Or the problem is in finding the old Java version?

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Jim Klimov


On 2013-01-20 19:55, Tomas Forsman wrote:

On 19 January, 2013 - Jim Klimov sent me these 2,0K bytes:


Hello all,

   While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact unique - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.


Another RFE would be 'zfs dedup mypool/somefs' and basically go through
and do a one-shot dedup. Would be useful in various scenarios. Possibly
go through the entire pool at once, to make dedups intra-datasets (like
the real thing).


Yes, but that was asked before =)

Actually, the pool's metadata does contain all the needed bits (i.e.
checksum and size of blocks) such that a scrub-like procedure could
try and find same blocks among unique ones (perhaps with a filter
of this block being referenced from a dataset that currently wants
dedup), throw one out and add a DDT entry to another.

On 2013-01-20 17:16, Edward Harvey wrote:
 So ... The way things presently are, ideally you would know in
 advance what stuff you were planning to write that has duplicate
 copies.  You could enable dedup, then write all the stuff that's
 highly duplicated, then turn off dedup and write all the
 non-duplicate stuff.  Obviously, however, this is a fairly
 implausible actual scenario.

Well, I guess I could script a solution that uses ZDB to dump the
blockpointer tree (about 100Gb of text on my system), and some
perl or sort/uniq/grep parsing over this huge text to find blocks
that are the same but not deduped - as well as those single-copy
deduped ones, and toggle the dedup property while rewriting the
block inside its parent file with DD.

This would all be within current ZFS's capabilities and ultimately
reach the goals of deduping pre-existing data as well as dropping
unique blocks from the DDT. It would certainly not be a real-time
solution (likely might take months on my box - just fetching the
BP tree took a couple of days) and would require more resources
than needed otherwise (rewrites of same userdata, storing and
parsing of addresses as text instead of binaries, etc.)

But I do see how this is doable even today even by a non-expert ;)
(Not sure I'd ever get around to actually doing this thus, though -
it is not a very clean solution nor a performant one).

As a bonus, however, this ZDB dump would also provide an answer
to a frequently-asked question: which files on my system intersect
or are the same - and have some/all blocks in common via dedup?
Knowledge of this answer might help admins with some policy
decisions, be it witch-hunt for hoarders of same files or some
pattern-making to determine which datasets should keep dedup=on...

My few cents,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-20 Thread Jim Klimov

On 2013-01-20 16:56, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov

And regarding the considerable activity - AFAIK there is little way
for ZFS to reliably read and test TXGs newer than X


My understanding is like this:  When you make a snapshot, you're just creating 
a named copy of the present latest TXG.  When you zfs send incremental from one 
snapshot to another, you're creating the delta between two TXG's, that happen 
to have names.  So when you break a mirror and resilver, it's exactly the same 
operation as an incremental zfs send, it needs to calculate the delta between 
the latest (older) TXG on the previously UNAVAIL device, up to the latest TXG 
on the current pool.  Yes this involves examining the meta tree structure, and 
yes the system will be very busy while that takes place.  But the work load is 
very small relative to whatever else you're likely to do with your pool during 
normal operation, because that's the nature of the meta tree structure ... very 
small relative to the rest of your data.


Hmmm... Given that many people use automatic snapshots, those do
provide us many roots for branches of block-pointer tree after a
certain TXG (creation of snapshot and the next live variant of
the dataset).

This might allow resilvering to quickly select only those branches
of the metadata tree that are known or assumed to have changed after
a disk was temporarily lost - and not go over datasets (snapshots)
that are known to have been committed and closed (became read-only)
while that disk was online.

I have no idea if this optimization does take place in ZFS code,
but it seems bound to be there... if not - a worthy RFE, IMHO ;)

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-20 Thread Jim Klimov


Did you try replacing the patch-cables and/or SFPs on the path
between servers and disks, or at least cleaning them? A speck
of dust (or, God forbid, a pixel of body fat from a fingerprint)
caught between the two optic cable cutoffs might cause any kind
of signal weirdness from time to time... and lead to improper
packets of that optic protocol.

Are there switch stats on whether it has seen media errors?

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Jim Klimov


On 2013-01-20 17:16, Edward Harvey wrote:

But, by talking about it, we're just smoking pipe dreams.  Cuz we all know zfs 
is developmentally challenged now.  But one can dream...


I beg to disagree. While most of my contribution was so far about
learning stuff and sharing with others, as well as planting some
new ideas and (hopefully, seen as constructively) doubting others -
including the implementation we have now - and I do have yet to
see someone pick up my ideas and turn them into code (or prove
why they are rubbish) -- overall I can't say that development
stagnated by some metric of stagnation or activity.

Yes, maybe there were more cool new things per year popping up
with Sun's concentrated engineering talent and financing, but now
it seems that most players - wherever they work now - took a pause
from the marathon, to refine what was done in the decade before.
And this is just as important as churning out innovations faster
than people can comprehend or audit or use them.

As a loud example of present active development - take the LZ4
quests completed by Saso recently. From what I gather, this is a
single man's job done on-line in the view of fellow list members
over a few months, almost like a reality-show; and I guess anyone
with enough concentration, time and devotion could do likewise.

I suspect many of my proposals to the list might also take some
half of a man-year to complete. Unfortunately for the community
and for part of myself, I now have some higher daily priorities
so that I likely won't sit down and code lots of stuff in the
nearest years (until that Priority goes to school, or so). Maybe
that's why I'm eager to suggest quests for brilliant coders here
who can complete the job better and faster than I ever would ;)
So I'm doing the next best things I can do to help the progress :)

And I don't believe this is in vain, that the development ceased
and my writings are only destined to be stuffed under the carpet.
Be it these RFEs or dome others, better and more useful, I believe
they shall be coded and published in common ZFS code. Sometime...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Jim Klimov


Hello all,

  While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact unique - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.

  Thus these many unique deduped blocks are just a burden when my
system writes into the datasets with dedup enabled, when it walks the
superfluously large DDT, when it has to store this DDT on disk and in
ARC, maybe during the scrubbing... These entries bring lots of headache
(or performance degradation) for zero gain.

  So I thought it would be a nice feature to let ZFS go over the DDT
(I won't care if it requires to offline/export the pool) and evict the
entries with count==1 as well as locate the block-pointer tree entries
on disk and clear the dedup bits, making such blocks into regular unique
ones. This would require rewriting metadata (less DDT, new blockpointer)
but should not touch or reallocate the already-saved userdata (blocks'
contents) on the disk. The new BP without the dedup bit set would have
the same contents of other fields (though its parents would of course
have to be changed more - new DVAs, new checksums...)

  In the end my pool would only track as deduped those blocks which do
already have two or more references - which, given the static nature
of such backup box, should be enough (i.e. new full backups of the same
source data would remain deduped and use no extra space, while unique
data won't waste the resources being accounted as deduped).

What do you think?
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Jim Klimov


On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)

The way I get it, resilvering is related to scrubbing but limited
in impact such that it rebuilds a particular top-level vdev (i.e.
one of the component mirrors) with an assigned-bad and new device.

So they both should walk the block-pointer tree from the uberblock
(current BP tree root) until they ultimately read all the BP entries
and validate the userdata with checksums. But while scrub walks and
verifies the whole pool and fixes discrepancies (logging checksum
errors), the resilver verifies a particular TLVdev (and maybe has
a cut-off earliest TXG for disks which fell out of the pool and
later returned into it - with a known latest TXT that is assumed
valid on this disk) and the process expects there to be errors -
it is intent on (partially) rewriting one of the devices in it.
Hmmm... Maybe that's why there are no errors logged? I don't know :)

As for practice, I also have one Thumper that logs errors on a
couple of drives upon every scrub. I think it was related to
connectors, at least replugging the disks helped a lot (counts
went from tens per scrub to 0-3). One of the original 250Gb disks
was replaced with a 3Tb one and a 250Gb partition became part of
the old pool (the remainder became a new test pool over a single
device). Scrubbing the pools yields errors in those new 250Gb,
but never on the 2.75Tb single-disk pool... so go figure :)

Overall, intermittent errors might be attibuted to non-ECC RAM/CPUs
(not our case), temperature affecting the mechanics and electronics
(conditioned server room - not our case), electric power variations
and noise (other systems in the room on the same and other UPSes
don't complain like this), and cable/connector/HBA degradation
(oxydization, wear, etc. - likely all that remains for our causes).
This example regards internal disks of the Thumper, so at least we
are certain to attribute no problems related to further breakage
components - external cables, disk trays, etc...

HTH,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Jim Klimov


On 2013-01-19 20:08, Bob Friesenhahn wrote:

On Sat, 19 Jan 2013, Jim Klimov wrote:


On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)


I don't think that zfs would call it scrubbing unless the user requested
scrubbing.  Unplugging a USB drive which is part of a mirror for a short
while results in considerable activity when it is plugged back in.  It
is as if zfs does not trust the device which was temporarily unplugged
and does a full validation of it.


Now, THAT would be resilvering - and by default it should be a limited
one, with a cutoff at the last TXG known to the disk that went MIA/AWOL.
The disk's copy of the pool label (4 copies in fact) record the last
TXG it knew safely. So the resilver should only try to validate and
copy over the blocks whose BP entries' birth TXG number is above that.
And since these blocks' components (mirror copies or raidz parity/data
parts) are expected to be missing on this device, mismatches are likely
not reported - I am not sure there's any attempt to even detect them.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors

2013-01-19 Thread Jim Klimov


On 2013-01-19 20:23, Jim Klimov wrote:

On 2013-01-19 20:08, Bob Friesenhahn wrote:

On Sat, 19 Jan 2013, Jim Klimov wrote:


On 2013-01-19 18:17, Bob Friesenhahn wrote:

Resilver may in fact be just verifying that the pool disks are coherent
via metadata.  This might happen if the fiber channel is flapping.


Correction: that (verification) would be scrubbing ;)


I don't think that zfs would call it scrubbing unless the user requested
scrubbing.  Unplugging a USB drive which is part of a mirror for a short
while results in considerable activity when it is plugged back in.  It
is as if zfs does not trust the device which was temporarily unplugged
and does a full validation of it.


Now, THAT would be resilvering - and by default it should be a limited
one, with a cutoff at the last TXG known to the disk that went MIA/AWOL.
The disk's copy of the pool label (4 copies in fact) record the last
TXG it knew safely. So the resilver should only try to validate and
copy over the blocks whose BP entries' birth TXG number is above that.
And since these blocks' components (mirror copies or raidz parity/data
parts) are expected to be missing on this device, mismatches are likely
not reported - I am not sure there's any attempt to even detect them.


And regarding the considerable activity - AFAIK there is little way
for ZFS to reliably read and test TXGs newer than X other than to
walk the whole current tree of block pointers and go deeper into those
that match the filter (TLVDEV number in DVA, and optionally TXG numbers
in birth/physical fields).

So likely the resilver does much of the same activity that a full scrub
would - at least in terms of reading all of the pool's metadata (though
maybe not all copies thereof).

My 2c and my speculation,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Jim Klimov


On 2013-01-19 23:39, Richard Elling wrote:

This is not quite true for raidz. If there is a 4k write to a raidz
comprised of 4k sector disks, then
there will be one data and one parity block. There will not be 4 data +
1 parity with 75%
space wastage. Rather, the space allocation more closely resembles a
variant of mirroring,
like some vendors call RAID-1E


I agree with this exact reply, but as I posted sometime late last year,
reporting on my digging in the bowels of ZFS and my problematic pool,
for a 6-disk raidz2 set I only saw allocations (including two parity
disks) divisible by 3 sectors, even if the amount of the (compressed)
userdata was not so rounded. I.e. I had either miniature files or tails
of files fitting into one sector plus two parities (overall a 3 sector
allocation), or tails ranging 2-4 sectors and occupying 6 with parity
(while 2 or 3 sectors could use just 4 or 5 w/parities, respectively).

I am not sure what these numbers mean - 3 being a case for one userdata
sector plus both parities or for half of 6-disk stripe - both such
explanations fit in my case.

But yes, with current raidz allocation there are many ways to waste
space. And those small percentages (or not so small) do add up.
Rectifying this example, i.e. allocating only as much as is used,
does not seem like an incompatible on-disk format change, and should
be doable within the write-queue logic. Maybe it would cause tradeoffs
in efficiency; however, ZFS does explicitly rotate starting disks
of allocations every few megabytes in order to even out the loads
among spindles (normally parity disks don't have to be accessed -
unless mismatches occur on data disks). Disabling such padding would
only help achieve this goal and save space at the same time...

My 2c,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-18 Thread Jim Klimov


On 2013-01-18 06:35, Thomas Nau wrote:

If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 
4K?  This seems like the most obvious improvement.


4k might be a little small. 8k will have less metadata overhead. In some cases
we've seen good performance on these workloads up through 32k. Real pain
is felt at 128k :-)


My only pain so far is the time a send/receive takes without really loading the
network at all. VM performance is nothing I worry about at all as it's pretty 
good.
So key question for me is if going from 8k to 16k or even 32k would have some 
benefit for
that problem?


I would guess that increasing the block size would on one hand improve
your reads - due to more userdata being stored contiguously as part of
one ZFS block - and thus sending of the backup streams should be more
about reading and sending the data and less about random seeking.

On the other hand, this may likely be paid off with the need to do more
read-modify-writes (when larger ZFS blocks are partially updated with
the smaller clusters in the VM's filesystem) while the overall system
is running and used for its primary purpose. However, since the guest
FS is likely to store files of non-minimal size, it is likely that the
whole larger backend block would be updated anyway...

So, I think, this is something an experiment can show you - whether the
gain during backup (and primary-job) reads vs. possible degradation
during the primary-job writes would be worth it.

As for the experiment, I guess you can always make a ZVOL with different
recordsize, DD data into it from the production dataset's snapshot, and
attach the VM or its clone to the newly created clone of its disk image.

Good luck, and I hope I got Richard's logic right in that answer ;)
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Jim Klimov


On 2013-01-17 16:04, Bob Friesenhahn wrote:

If almost all of the I/Os are 4K, maybe your ZVOLs should use a
volblocksize of 4K?  This seems like the most obvious improvement.



Matching the volume block size to what the clients are actually using
(due to their filesystem configuration) should improve performance
during normal operations and should reduce the number of blocks which
need to be sent in the backup by reducing write amplification due to
overlap blocks..



Also, it would make sense while you are at it to verify that the
clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that
their partitions start at a 512b-based sector offset divisible by
8 inside the virtual HDDs, and the FS headers also align to that
so the first cluster is 4KB-aligned.

Classic MSDOS MBR did not warrant that partition start, by using
63 sectors as the cylinder size and offset factor. Newer OSes don't
use the classic layout, as any config is allowable; and GPT is well
aligned as well.

Overall, a single IO in the VM guest changing a 4KB cluster in its
FS should translate to one 4KB IO in your backend storage changing
the dataset's userdata (without reading a bigger block and modifying
it with COW), plus some avalanche of metadata updates (likely with
the COW) for ZFS's own bookkeeping.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Jim Klimov


On 2013-01-18 00:42, Bob Friesenhahn wrote:

You can install Brendan Gregg's DTraceToolkit and use it to find out who
and what is doing all the writing.  1.2GB in an hour is quite a lot of
writing.  If this is going continuously, then it may be causing more
fragmentation in conjunction with your snapshots.


As a moderately wild guess, since you're speaking of galleries,
are these problematic filesystems often-read? By default ZFS
updates the last access-time of files it reads, as do many other
filesystems, and this causes avalanches of metadata updates -
sync writes (likely) as well as fragmentation. This may also
be a poorly traceable but considerable used space in frequent
snapshots. You can verify (and unset) this behaviour with the
ZFS FS dataset property atime, i.e.:

# zfs get atime pond/export/home
NAME  PROPERTY  VALUE  SOURCE
pond/export/home  atime offinherited from pond

On another hand, verify where your software keeps the temporary
files (i.e. during uploads as may be with galleries). Again, if
this is a frequently snapshotted dataset (though 1 hour is not
really that frequent) then needless temp files can be held by
those older snapshots. Moving such temporary works to a different
dataset with a different snapshot schedule and/or to a different
pool (to keep related fragmentation constrained) may prove useful.

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] help zfs pool with duplicated and missing entry of hdd

2013-01-10 Thread Jim Klimov


On 2013-01-10 08:51, Jason wrote:

Hi,

One of my server's zfs faulted and it shows following:
NAMESTATE READ WRITE CKSUM
 backup  UNAVAIL  0 0 0  insufficient replicas
   raidz2-0  UNAVAIL  0 0 0  insufficient replicas
 c4t0d0  ONLINE   0 0 0
 c4t0d1  ONLINE   0 0 0
 c4t0d0  FAULTED  0 0 0  corrupted data
 c4t0d3  FAULTED  0 0 0  too many errors
 c4t0d4  FAULTED  0 0 0  too many errors
...(omit the rest).

My question is why c4t0d0 appeared twice, and c4t0d2 is missing.

Have check the controller card and hard disk, they are all working fine.


This renaming does seem like an error in detecting (and further naming)
of the disks - i.e. if a connector got loose, and one of the disks is
not seen by the system, the numbering can shift in such manner. It is
indeed strange however that only d2 got shifted or missing and not
all those numbers after it.

So, you did verify that the controller sees all the disks in format
command (and perhaps after a cold reboot - in BIOS)? Just in case, try
to unplug and replug all cables (power, data) in case their pins got
oxydized over time.

HTH,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pool performance when nearly full

2012-12-20 Thread Jim Klimov

ctime   Fri Jun  8 00:22:17 2012
crtime  Fri Jun  8 00:22:17 2012
gen 1349746
mode100755
size649720
parent  25
links   1
pflags  4080104
Indirect blocks:
   0 L1  DVA[0]=0:940298000:400 DVA[1]=0:263234a00:400
[L1 ZFS plain file] fletcher4 lzjb LE contiguous unique double
size=4000L/400P birth=1349746L/1349746P fill=5
cksum=682d4fda0b:3cc1aa306094:13ebb22837cf14:4c5c67e522dbca8

   0  L0 DVA[0]=0:95f337000:2 [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=2L/2P
birth=1349746L/1349746P fill=1
cksum=23fce6aa160b:5ab11e5fcbc6c2e:5b38f230e01d508d:12cf92941e4b2487

   2  L0 DVA[0]=0:95f357000:2 [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=2L/2P
birth=1349746L/1349746P fill=1
cksum=3f0ac207affd:f8ed413113d6bdd:24e36c7682cfc297:2549c866ab61e464

   4  L0 DVA[0]=0:95f377000:2 [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=2L/2P
birth=1349746L/1349746P fill=1
cksum=3d40bf3329f0:f459bc876303dd7:2230ee348b7b08c5:3a65d1ebbf52c9dc

   6  L0 DVA[0]=0:95f397000:2 [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=2L/2P
birth=1349746L/1349746P fill=1
cksum=19e01b53eb67:956b52d1df6ecd4:38ff9bd1302bf879:e4661798dd1ae8a0

   8  L0 DVA[0]=0:95f3b7000:2 [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=2L/2P
birth=1349746L/1349746P fill=1
cksum=361e6fd03d40:d0903e491fa09e9:7a2e453ed28baa92:28562c53af3c0495

segment [, 000a) size  640K

After several higher layers of the pointers (just L1 in example above),
you have L0 entries which point to actual data blocks with their DVA
fields.

The example file above fits in five 128K blocks at level L0.

The first component of the DVA address is the top-level vdev ID,
followed by offset and allocation size (including raidzN redundancy).
Depending on your pool's history, larger files may have been striped
over several TLVDEVs however, and relocating them (copying over and
deleting the original) might help or not help free up a particular
TLVDEV (upon rewrite they will be striped again, albeit maybe ZFS
will make different decisions upon a new write - and prefer the more
free devices).

Also, if the file's blocks are referenced via snapshots, clones,
dedup or hardlinks, they won't actually be released when you delete
a particular copy of the file.

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] The format command crashes on 3TB disk but zpool create ok

2012-12-15 Thread Jim Klimov

On 2012-12-14 17:03, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

Suspicion and conjecture only:  I think format uses a fdisk label, which has a 
2T limit.



Technically, fdisk is a program and labels (partitioning tables)
are MBR and EFI/GPT :)

And fdisk at least in OpenIndiana can explicitly label a disk as
EFI, similarly to what ZFS does when given the whole disk to a pool.

You might also have luck with GNU parted, though I've had older
builds (i.e. in SXCE) crash on 3Tb disks too, including one that's
labeled as EFI and used in a pool on the same SXCE. There were no
such problems with newer build of parted as in OI, so that disk was
in fact labeled for SXCE while the box was booted with OI LiveCD.

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-11 Thread Jim Klimov


On 2012-12-02 05:42, Jim Klimov wrote:

My plan is to dig out the needed sectors of the broken block from
each of the 6 disks and try any and all reasonable recombinations
of redundancy and data sectors to try and match the checksum - this
should be my definite answer on whether ZFS (of that oi151.1.3-based
build) does all I think it can to save data or not. Either I put the
last nail into my itching question's coffin, or I'd nail a bug to
yell about ;)


Well, I've come to a number of conclusions, though did not yet close
this matter to myself. One regards the definition of all reasonable
recombinations - ZFS does not do *everything* possible to recover
corrupt data, and in fact it can't, nobody can.

When I took this to an extreme, assuming that the bytes at different
offsets within a sector might fail on different disks that comprise a
block, attempt to reconstruct and test single failed sector per-byte
becomes computationally infeasible - for 4 data disks and 1 parity I
got about 4^4096 combinations to test. The next Big Bang will happen
sooner than I'd get a yes or no, or so they say (yes, I did a rough
estimate - about 10^100 seconds if I used all computing horsepower on
Earth today).

If there are R known-broken rows of data (be it bits, bytes, sectors,
whole columns, or whatever quantum of data we take) on D data disks
and P parity disks (all readable without HW IO errors), where known
brokenness is both a parity mismatch in this row and checksum mismatch
for the whole userdata block, we do not know in advance how many errors
there are in the row (only hope that not more than there are parity
columns) nor where exactly the problem is. Thanks to checksum mismatch
we do know that at least one error is in the data disks' on-disk data.

We might hope to find a correct original data which matches the
checksum by determining for each data disk the possible alternate
byte values (computed from bytes at same offsets on other disks of
data and parity), and checksumming the recombined userdata blocks
with some of the on-disk bytes replaced by these calculated values.

For each row we test 1..P alternate column values, and we must apply
the alteration to all of the rows where known errors exist, in order
to detect some neighboring but not overlapping errors in different
components of the block's allocation. (This was the breakage scenario
that was deemed possible for raidzN with disk heads hovering over
similar locations all the time).

This can yield a very large field of combinations with small height
of rows (i.e. matching 1 byte per disk), or too few combinations with
row height chosen too big (i.e. whole portion of one disk's part of
the userdata - quarter in case of my 4-data-disk set).

For single-break-per-row tests based on hypotheses from P parities,
D data disks and R broken rows, we need to checksum P*(D^R) userdata
recombinations in order to determine that we can't recover the block.

To catch the less probable several errors per row (up to the amount of
parities we have), we need to retry even more combinations afterwards.

My 5-year-old Pentium D tested 1000 sha256 checksums over 128KB blocks
in about 2-3 seconds, so it is reasonable to keep reconstruction loops
and thus the smallness of a step and thus the amount of steps within a
given arbitrarily chosen timeout (30 sec? 1 sec?) With a fixed amount
of parity and data disks in a particular TLVDEV, we can determine the
reasonable row heights. Also, this low-level recovery at higher
amount of cycles might be a job for a separate tool - i.e. on-line
recovery during ZFS IO and scrubs might be limited by a few sectors,
and whatever is not fixed by that can be manually fed to programmatic
number-cruncher and possibly get recovered overnight...

I now know that it is cheap and fast to determine parity mismatches
for each single-byte column offset in a userdata block (leading to
D*R userdata bytes whose contents we are not certain of), so even
if the quantum of data for reconstructions is a sector, it is
quite reasonable to start with byte-by-byte mismatch detection.

Locations of detected errors can help us determine whether the
errors are colocated in a single row of sectors (so likely one or
more sectors at the same offset on different disks got broken),
or in several sectors (we might be lucky and have single errors
per disk in neighboring sector numbers).

It is, after all, not reasonable to go below 512b or even the
larger HW sector size as the quantum of data for recovery attempts.
But testing *only* whole columns (*if* this is done today) also
avoids some chances of automated recovery - though, certainly,
the recovery attempts should start with some of the most probable
combinations, such as all errors being confined to a single disk,
and then going down in step size and testing possible errors on
several component disks. We can afford several thousand checksum
tests, which might give a chance to recover more data that might
be recoverable

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-11 Thread Jim Klimov


On 2012-12-11 16:44, Jim Klimov wrote:

For single-break-per-row tests based on hypotheses from P parities,
D data disks and R broken rows, we need to checksum P*(D^R) userdata
recombinations in order to determine that we can't recover the block.


A small maths correction: the formula above reflects that we change
some one item from on-disk value to reconstrycted hypothesis on some
one data disk(column) in all rows, or on P disks if we try to recover
from more than one failed item in a row.

Reality is worse :)

Our original info (parity errors and checksum mismatch) warranted
only that we have at least one error in userdata. It is possible
that other (R-1) errors are on the parity disk, so the recombination
should also check all variants with (0..R-1) unchanged rows with
their on-disk contents intact.

This gives us something like P*(D + D^2 + ... + D^R) variants to
test, which is roughly a 25% increase in recombinations in the
range of computationally feasible amounts of error-matching.


Heck, just counting from 1 to 2^64 in a i++ loop takes a lot
of CPU time


By my estimate, even that would take until the next Big Bang,
at least on my one computer ;)

Just for fun: a count to 2^32 took 42 seconds, so my computer
can do 10^8 trivial loops per second - but that's just a data
point. What really matters is that 4^64 == (2^32)^33, which
is a lot. Roughly, 2^3 = 8 ~= 10, so the plain count from 1 to
4^64 would take about 42*10^30 seconds, or roughly 10^24 years.
If the astronomers' estimates are correct, this amounts to
10^13 lifetimes of our universe, or so ;)

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-10 Thread Jim Klimov


On 2012-12-10 07:35, Timothy Coalson wrote:

The corrupted area looks like a series of 0xFC 0x42 bytes about
half a kilobyte long, followed by zero bytes to the end of sector.
Start of this area is not aligned to a multiple of 512 bytes.


Just a guess, but that might be how the sectors were when the drive came
from the manufacturer, rather than filled with zeros (a test pattern
while checking for bad sectors).  As for why some other sectors did show
zeros in your other results, perhaps those sectors got reallocated from
the reserved sectors after whatever caused your problems, which may not
have been written to during the manufacturer test.


Thanks for the idea. I also figured it might be some test pattern
or maybe some sort of secure wipe, and HDD's relocation to spare
sectors might be a reasonable scenario for such an error creeping
into an LBA which previously had valid data - i.e. the disk tried
to salvage as much of a newly corrupted sector as it could...

I dismissed it because several HDDs had the error at same offsets,
and some of them had the same contents of the corrupted sectors;
how-ever identical the disks might be, this is just too much of a
coincidence for disk-internal hardware relocation to be The reason.


Controller going haywire - that is possible, given that this box
was off until recently repaired due to broken cooling, and this
is the nearest centralized SPOF location common to all disks
(with overheated CPU, non-ECC RAM and the software further along
the road). I am not sure which one of these *couldn't* issue
(or be interpreted to issue) a number of weird identical writes
to different disks at same offsets.

Everyone is a suspect :(

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-09 Thread Jim Klimov


more below...

On 2012-12-06 03:06, Jim Klimov wrote:

It also happens that on disks 1,2,3 the first row's sectors (d0, d2, d3)
are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes.

The neighboring blocks, located a few sectors away from this one, also
have compressed data and have some regular-looking patterns of bytes,
certainly no long stretches of zeroes.

However, the byte-by-byte XOR matching complains about the whole sector.
All bytes, except some 40 single-byte locations here and there, don't
XOR up to produce the expected (known from disk) value.

I did not yet try the second parity algorithm.

At least in this case, it does not seem that I would find an incantation
needed to recover this block - too many zeroes overlapping (at least 3
disks' data proven compromised), where I did hope for some shortcoming
in ZFS recombination exhaustiveness. In this case - it is indeed too
much failure to handle.

Now waiting for scrub to find me more test subjects - broken files ;)

So, these findings from my first tested bad file remain valid.
Now that I have a couple more error locations found again by
scrub (which for the past week progressed just above 50% of
the pool), there are some more results.

So far only one location has random-looking different data in
the sectors of the block on different disks, which I might at
least try to salvage as described in the beginning of this thread.

In two of three cases, some of the sectors (in the range which
mismatches the parity data) are not only clearly invalid, like
being filled with long stretches of zeroes with other sectors
being uniformly-looking binary data (results of compression).
Moreover, several of these sectors (4096-bytes long at same
offsets on different drives which are data components of the
same block) are literally identical, which is apparently some
error upon write (perhaps, some noise was interpreted by several
disks at once like a command for them to write at that location).

The corrupted area looks like a series of 0xFC 0x42 bytes about
half a kilobyte long, followed by zero bytes to the end of sector.
Start of this area is not aligned to a multiple of 512 bytes.

These disks being of an identical model and firmware, I am ready
to believe that they might misinterpret same interference in the
same way. However, I was under the impression that SATA involved
CRCs on commands and data in the protocol - to counter the noise?..



Question: does such conclusion sound like a potentially possible
explanation for my data corruptions (on disks which passed dozens
of scrubs successfully before developing these problems nearly at
once in about ten locations)?

Thanks for attention,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Zpool error in metadata:0x0

2012-12-08 Thread Jim Klimov


I've had this error on my pool since over a year ago, when I
posted and asked about it. The general consent was that this
is only fixable by recreation of the pool, and that if things
don't die right away, the problem may be benign (i.e. in some
first blocks of MOS that are in practice written once and not
really used nor relied upon).

In detailed zpool status this error shows as:
  metadata:0x0

By analogy to other errors in unnamed files, this was deemed to
be the MOS dataset, object number 0.

Anyway, now that I am digging deeper into ZFS bowels (as detailed
in my other current thread), I've made a tool which can request
sectors which pertain to a given DVA and verify the XOR parity.

With ZDB I've extracted what I believe to be the block-pointer
tree for this despite ZDB trying to dump the whole pool upon
access to no child dataset (I saw recently on-list that someone
picked up this ZDB bug as well), I used a bit of perl magic:

# time zdb -d -bb -e 1601233584937321596 0 | \
  perl -e '$a=0; while () { chomp; if ( /^Dataset mos/ ) { $a=1; }
  elsif ( /^Dataset / ) {$a=2; exit 0;};
  if ( $a == 1 ) { print $_\n; }  }'  mos.txt

This gives me everything ZDB thinks is part of MOS, up to the
start of a next Dataset dump:

Dataset mos [META], ID 0, cr_txg 4, 50.5G, 76355 objects,
rootbp DVA[0]=0:590df6a4000:3000 DVA[1]=0:8e4c636000:3000
DVA[2]=0:8107426b000:3000 [L0 DMU objset] fletcher4 lzjb LE
contiguous unique triple size=800L/200P birth=326429440L/326429440P
fill=76355 cksum=1042f7ae8a:63ab010a1de:138cbe92583cd:29e4cd03f544fe

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 0316K16K  84.1M  80.2M   46.49  DMU dnode
dnode flags: USED_BYTES
dnode maxblkid: 5132
Indirect blocks:

   0 L2   DVA[0]=0:590df6a1000:3000 DVA[1]=
0:8e4c63:3000 DVA[2]=0:81074268000:3000 [L2 DMU dnode]
fletcher4 lzjb LE contiguous unique triple size=4000L/e00P
birth=326429440L/326429440P fill=76355
cksum=128bfcb12fe:237fe2ec55891:29135030da5c326:36973942bee30ba3

   0  L1  DVA[0]=0:590df69b000:6000 DVA[1]=
0:8fd76b8000:6000 DVA[2]=0:81074262000:6000 [L1 DMU dnode]
fletcher4 lzjb LE contiguous unique triple size=4000L/1200P
birth=326429440L/326429440P fill=1155
cksum=18d8d8f3e6c:3ab2b45afba95:57ad6e7efb1cb00:216c4680d8cb9644

   0   L0 DVA[0]=0:590df695000:3000 DVA[1]=
0:8e4c61e000:3000 DVA[2]=0:8107425c000:3000 [L0 DMU dnode]
fletcher4 lzjb LE contiguous unique triple size=4000L/c00P
birth=326429440L/326429440P fill=31
cksum=da94d97873:15b87afcb5388:15ac58fbe7745d6:2e083d8ef9f3c90
...
(for a total of 3572 block pointers)

I fed this list into my new verification tool, testing all DVA
ditto copies, and it found no blocks with bad sectors - all the
XOR parities and the checksums matched their sector or two worth
of data.

So, given that there are no on-disk errors in the Dataset mos
[META], ID 0 Object #0 - what does the zpool scrub find time
after time and call an error in metadata:0x0?

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Userdata physical allocation in rows vs. columns WAS Digging in the bowels of ZFS

2012-12-08 Thread Jim Klimov


For those who have work to do and can't be bothered to read detailed
context, please do scroll down to the marked Applied question about
the possible project to implement a better on-disk layout of blocks.
The busy experts' opinions are highly regarded here. Thanks ;) //Jim

CONTEXT AND SPECULATION

Well, now that I've mostly completed building my tool to locate and
extract from disk and verify the sectors related to any particular
block, I can state with certainty: data sector numbering is columnar
as was depicted in my recent mails (quote below), not rows as I had
believed earlier - and which would be more compact to store.

Columns do make certain sense, but do also lead to more wasted space
than could be possible otherwise - and I'm not sure if the allocation
in rows would be really slower to write or read, especially since
the HDD caching would coalesce requests to neighboring sectors -
be they a contiguous quarter of my block's physical data or a series
of every fourth sector from that. This would be more complex to code
and comprehend - likely. Might even require more CPU cycles to account
sizes properly (IF today we just quickly allocate columns of same
size - I skimmed over vdev_raidz.c, but did not look into this detail).

Saving 1-2 sectors from allocations which are some 10-30 sectors long
altogether - this is IMHO a worthy percentage of savings to worry and
bother about, especially with the compression-related paradigm of
our CPUs are slackers with nothing to do. ZFS overhead on 4K-sectored
disks is pretty expensive already, so I see little need to feed it
extra desserts too ;)

APPLIED QUESTION:

If one were to implement a different sector allocator (rows with more
precise cutoff vs. columns as they are today) and expose it as a zfs
property that can be set by users (or testing developers), would it
make sense to call it a compression mode (in current terms) and use
a bit from that field? Or should a GRID bits be more properly used for
this?

I am not sure if feature flags are a proper mechanism for this, except
to protect form import and interpretation of such fixed datasets and
pools on incompatible (older) implementations - the allocation layout
is likely going to be an attribute applied to each block at write-time
and noted in blkptr_t like the checksums and compression, but only
apply to raidzN.

AFAIK, the contents of userdata sectors and their ordering don't even
matter to ZFS layers until decompression - parities and checksums just
apply to prepared bulk data...


//Jim Klimov

On 2012-12-06 02:08, Jim Klimov wrote:

On 2012-12-05 05:52, Jim Klimov wrote:

For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:

p1 p2 d1 d2 d3 d4
.  ,  1  5  9   13
.  ,  2  6  10  14
.  ,  3  7  11  x
.  ,  4  8  12  x

In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can't really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).


Getting more and more puzzled with this... I have seen DVA values
matching both theories now...

Interestingly, all the allocations I looked over involved the number
of sectors divisible by 3... rounding to half of my 6-disk RAID set -
is it merely a coincidence, or some means of balancing IOs?

...

I did not yet research where exactly the unused sectors are
allocated - vertically on the last strip, like in my yesterdays
depiction quoted above, or horizontally across several disks,
but now that I know about this - it really bothers me as wasted
space with no apparent gain. I mean, the raidz code does tricks
to ensure that parities are located on different disks, and in
normal conditions the userdata sector reads land on all disks
in a uniform manner. Why forfeit the natural rotation thanks
to P-sizes smaller than the multiple of number of data-disks?

...

In short: can someone explain the rationale - why are allocations
such as they are now, and can it be discussed as a bug or should
this be rationalized as a feature?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Remove disk

2012-12-06 Thread Jim Klimov


On 2012-12-06 09:35, Albert Shih wrote:


1) add a 5th top-level vdev (eg. another set of 12 disks)


That's not a problem.


That IS a problem if you're going to ultimately remove an enclosure -
once added, you won't be able to remove the extra top-level VDEV from
your ZFS pool.


2) replace the disks with larger ones one-by-one, waiting for a
resilver in between


This is the point I don't see how to do it. I've 48 disk actually from
/dev/da0 - /dev/da47 (I'm under FreeBSD 9.0) lets say 3To.

I've 4 raidz2 the first from /dev/da0 - /dev/da11 etc..

So I add physically a new enclosure with new 12 disks for example 4To disk.

I'm going to have new /dev/da48 -- /dev/da59.

Say I want remove /dev/da0 - /dev/da11. First I pull out the /dev/da0.


I believe FreeBSD should perform similarly to that in Solaris-based
OSes. Since your pools are not yet broken, and since you have the
luxury of all disks being present during migration, it is safer not
to pull out a disk physically and put a new one in its place
(physically or via hotsparing), but rather to try software replacement
with zpool replace. This way your pool does not lose redundancy for
the duration of replacement.


The first raidz2 going to be in «degraded state». So I going to tell the
pool the new disk is /dev/da48.

repeat this_process until /dev/da11 replace by /dev/da59.


Roughly so. Other list members might chime in - but MAYBE it is even
possible or advisable to do software replacement on all 12 disks in
parallel (since the originals are all present)?


But at the end how many space I'm going to use on those /dev/da48 --
/dev/da51. Am I going to have 3To or 4To ? Because each time before
complete ZFS going to use only 3 To how at the end he going to magically
use 4To ?


While the migration is underway and some but not all disks have
completed it, you can only address the old size (3To); when your
active disks are all big - you'd suddenly see the pool expand to
use the available space (if the autoexpand property is on), or
use a series of zpool online -e componentname.


When I would like to change the disk, I also would like change the disk
enclosure, I don't want to use the old one.


Second question, when I'm going to pull out the first enclosure meaning the
old /dev/da0 -- /dev/da11 and reboot the server the kernel going to give
new number of those disk meaning

old /dev/da12 -- /dev/da0
old /dev/da13 -- /dev/da1
etc...
old /dev/da59 -- /dev/da47

how zfs going to manage that ?



Supposedly, it should manage that well :)
Once your old enclosure's disks are not used anyway, so you can remove
it, you should zpool export your pool before turning off the hardware.
This would remove it from the OS's zfs cachefile, and upon the next
import the pool would undergo a full search for components. It is slower
than cachefile when you have many devices at static locations, because
it ensures that all storage devices are consulted and the new map of
the pool components' locations is drawn. Thus the device numbering
would change somehow due to HW changes and OS reconfiguration, then
the full zpool import will take note of this and import old data from
new addresses (device-names).

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS QoS and priorities

2012-12-05 Thread Jim Klimov


On 2012-12-05 04:11, Richard Elling wrote:

On Nov 29, 2012, at 1:56 AM, Jim Klimov jimkli...@cos.ru
mailto:jimkli...@cos.ru wrote:


I've heard a claim that ZFS relies too much on RAM caching, but
implements no sort of priorities (indeed, I've seen no knobs to
tune those) - so that if the storage box receives many different
types of IO requests with different administrative weights in
the view of admins, it can not really throttle some IOs to boost
others, when such IOs have to hit the pool's spindles.


Caching has nothing to do with QoS in this context. *All* modern
filesystems cache to RAM, otherwise they are unusable.


Yes, I get that. However, many systems get away with less RAM
than recommended for ZFS rigs (like the ZFS SA with a couple
hundred GB as the starting option), and make their compromises
elsewhere. They have to anyway, and they get different results,
perhaps even better suited to certain narrow or big niches.

Whatever the aggregate result, this difference does lead to
some differing features that The Others' marketing trumpets
praise as the advantage :) - like this ability to mark some
IO traffic as of higher priority than other traffics, in one
case (which is now also an Oracle product line, apparently)...

Actually, this question stems from a discussion at a seminar
I've recently attended - which praised ZFS but pointed out its
weaknesses against some other players on the market, so we are
not unaware of those.


For example, I might want to have corporate webshop-related
databases and appservers to be the fastest storage citizens,
then some corporate CRM and email, then various lower priority
zones and VMs, and at the bottom of the list - backups.


Please read the papers on the ARC and how it deals with MFU and
MRU cache types. You can adjust these policies using the primarycache
and secondarycache properties at the dataset level.


I've read on that, and don't exactly see how much these help
if there is pressure on RAM so that cache entries expire...
Meaning, if I want certain datasets to remain cached as long
as possible (i.e. serve website or DB from RAM, not HDD), at
expense of other datasets that might see higher usage, but
have lower business priority - how do I do that? Or, perhaps,
add (L2)ARC shares, reservations and/or quotas concepts to the
certain datasets which I explicitly want to throttle up or down?

At most, now I can mark the lower-priority datasets' data or
even metadata as not cached in ARC or L2ARC. On-off. There seems
to be no smaller steps, like in QoS tags [0-7] or something like
that.

BTW, as a short side question: is it a true or false statement,
that: if I set primarycache=metadata, then ZFS ARC won't cache
any userdata and thus it won't appear in (expire into) L2ARC?
So the real setting is that I can cache data+meta in RAM, and
only meta in SSD? Not the other way around (meta in RAM but
both data+meta in SSD)?



AFAIK, now such requests would hit the ARC, then the disks if
needed - in no particular order. Well, can the order be made
particular with current ZFS architecture, i.e. by setting
some datasets to have a certain NICEness or another priority
mechanism?


ZFS has a priority-based I/O scheduler that works at the DMU level.
However, there is no system call interface in UNIX that transfers
priority or QoS information (eg read() or write()) into the file system VFS
interface. So the grainularity of priority control is by zone or dataset.


I do not think I've seen mention of priority controls per dataset,
at least not in generic ZFS. Actually, that was part of my question
above. And while throttling or resource shares between higher level
software components (zones, VMs) might have similar effect, this is
not something really controlled and enforced by the storage layer.


  -- richard


Thanks,
//Jim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS QoS and priorities

2012-12-05 Thread Jim Klimov


On 2012-11-29 10:56, Jim Klimov wrote:

For example, I might want to have corporate webshop-related
databases and appservers to be the fastest storage citizens,
then some corporate CRM and email, then various lower priority
zones and VMs, and at the bottom of the list - backups.


On a side note, I'm now revisiting old ZFS presentations collected
over the years, and one suggested as TBD statements the ideas
that metaslabs with varying speeds could be used for specific
tasks, and not only to receive the allocations first so that a new
pool would perform quickly. I.e. TBD: Workload specific freespace
selection policies.

Say, I create a new storage box and lay out some bulk file, backup
and database datasets. Even as they are receiving their first bytes,
I have some idea about the kind of performance I'd expect from them -
with QoS per dataset I might destine the databases to the fast LBAs
(and smaller seeks between tracks I expect to use frequently), and
the bulk data onto slower tracks right from the start, and the rest
of unspecified data would grow around the middle of the allocation
range.

These types of data would then only creep onto the less fitting
metaslabs (faster for bulk, slower for DB) if the target ones run
out of free space. Then the next-best-fitting would be used...

This one idea is somewhat reminiscent of hierarchical storage
management, except that it is about static allocation at the
write-time and takes place within the single disk (or set of
similar disks), in order to warrant different performance for
different tasks.

///Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot

2012-12-05 Thread Jim Klimov

On 2012-11-17 22:54, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Edward Ned Harvey

An easier event to trigger is the starting of the virtualbox guest.  Upon vbox
guest starting, check the service properties for that instance of vboxsvc, and
chmod if necessary.  But vboxsvc runs as non-root user...

I like the idea of using zfs properties, if someday the functionality is going 
to
be built into ZFS, and we can simply scrap the SMF chown service.  But these
days, ZFS isn't seeing a lot of public development.


I just built this into simplesmf, http://code.google.com/p/simplesmf/
Support to execut the zvol chown immediately prior to launching guestvm
I know Jim is also building it into vboxsvc, but I haven't tried that yet.



Lest this point be lost - during discussion of the thread, Edward and
myself ultimately embarked on the voyage to the solutions we saw best,
hacked together during that day or so.

Edward tailored his to VM startup events, while I made a more generic
script which can save POSIX and ACL info from devfs into user attributes
of ZVOLs and extract and apply those values to ZVOLs on demand.

This script can register itself as an SMF service, and apply such values
from zfs to devfs at service startup, and save from devfs to zfs at the
service shutdown. I guess this can be integrated into my main vbox.sh
script to initiate such activities during VM startup, but haven't yet
explored or completed this variant (all the needed pieces should be
there already). Perhaps I need to make such integration before next
official release of vboxsvc.

This is rather a proof-of-concept so far (i.e. the script should be
sure to run after zpool imports/before zpool exports), but brave souls
can feel free to try it out and comment. Presence of the service didn't
cause any noticeable troubles on my test boxen over the past couple of
weeks.

http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/zfs-zvolrights

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VXFS to ZFS

2012-12-05 Thread Jim Klimov


On 2012-12-05 23:11, Morris Hooten wrote:

Is there a documented way or suggestion on how to migrate data from VXFS
to ZFS?


Off the top of my head, I think this would go like any other migration -
create the new pool on new disks and use rsync for simplicity (if your
VxFS setup does not utilize extended attributes or anything similarly
special), or use Solaris tar or cpio of such attributes are used (IIRC
VxFS was a prime citizen in Solaris, so native tools - unlike GNU ones
and rsync - should support the intimate details).

Also note that if you have VxFS, then you likely come from a clustered
setup, which may be quite native and safe to VxFS. ZFS does not support
simultaneous pool-imports by several hosts, so you'd have to set up the
clusterware to make sure only one host controls the pool at any time.

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-05 Thread Jim Klimov


On 2012-12-05 05:52, Jim Klimov wrote:

For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:

p1 p2 d1 d2 d3 d4
.  ,  1  5  9   13
.  ,  2  6  10  14
.  ,  3  7  11  x
.  ,  4  8  12  x

In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can't really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).


Getting more and more puzzled with this... I have seen DVA values
matching both theories now...

Interestingly, all the allocations I looked over involved the number
of sectors divisible by 3... rounding to half of my 6-disk RAID set -
is it merely a coincidence, or some means of balancing IOs?

Anyhow, with 4KB sectors involved, I saw many 128KB logical blocks
compressed into just half a dozen sectors of userdata payload, so
wasting one or two sectors here is quite a large percentage of my
storage overhead.

Exposition of found evidence follows:


Say, this one from my original post:
DVA[0]=0:594928b8000:9000 ... size=2L/4800P

It has 5 data sectors (@4Kb) over 4 data disks in my raidz2 set,
so it spills over to a second row and requires additional parity
sectors - overall 5d+4p = 9 sectors, which we see in DVA A-size.
This is normal, like expected.

These ones however differ:

DVA[0]=0:acef500e000:c000 ... size=2L/6a00P
DVA[0]=0:acef501a000:c000 ... size=2L/7200P
DVA[0]=0:acef5026000:c000 ... size=2L/5c00P

These neighbors, with 7, 8 and 6 sectors worth of data all occupy
12 sectors on disk along with their parities.



DVA[0]=0:59492a92000:6000 ... size=2L/2800P

With 3*4Kb sectors worth of data and 2 parity sectors, this block
is allocated over 6 not 5 sectors.


DVA[0]=0:5996bf7c000:12000 ... size=2L/a800P

Likewise, with 11 sectors of data and likely 6 sectors of parity,
this one is given 18, not 17 sectors of storage allocation.



DVA[0]=0:5996be32000:1e000 ... size=2L/12c00P

Here, 19 sectors of data and 10 of parity occupy 30 sectors on disk.


I did not yet research where exactly the unused sectors are
allocated - vertically on the last strip, like in my yesterdays
depiction quoted above, or horizontally across several disks,
but now that I know about this - it really bothers me as wasted
space with no apparent gain. I mean, the raidz code does tricks
to ensure that parities are located on different disks, and in
normal conditions the userdata sector reads land on all disks
in a uniform manner. Why forfeit the natural rotation thanks
to P-sizes smaller than the multiple of number of data-disks?
Writes are anyway streamed and coalesced, so by not allocating
these unused blocks we'd only reduce the needed write IOPS by
some portion - and save disk space...


In short: can someone explain the rationale - why are allocations
such as they are now, and can it be discussed as a bug or should
this be rationalized as a feature?

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-05 Thread Jim Klimov


more below...

On 2012-12-05 23:16, Timothy Coalson wrote:

On Tue, Dec 4, 2012 at 10:52 PM, Jim Klimov jimkli...@cos.ru
mailto:jimkli...@cos.ru wrote:

On 2012-12-03 18:23, Jim Klimov wrote:

On 2012-12-02 05:42, Jim Klimov wrote:


  4) Where are the redundancy algorithms specified? Is there any
simple
  tool that would recombine a given algo-N redundancy sector with
  some other 4 sectors from a 6-sector stripe in order to try and
  recalculate the sixth sector's contents? (Perhaps part of some
  unit tests?)


I'm a bit late to the party, but from a previous list thread about
redundancy algorithms, I had found this:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

Particularly the
functions vdev_raidz_reconstruct_p, vdev_raidz_reconstruct_q, 
vdev_raidz_reconstruct_pq
(and possibly vdev_raidz_reconstruct_general) seem like what you are
looking for.

As I understand it, the case where you have both redundancy blocks, but
are missing two data blocks, is the hardest (if you are missing only one
data block, you can either do straight xor with the first redundancy
section, or some LFSR shifting, xor, and then reverse LFSR shifting to
use the second redundancy section).

Wikipedia describes the math to restore from two missing data sections
here, under computing parity: http://en.wikipedia.org/wiki/Raid6#RAID_6

I don't know any tools to do this for you from arbitrary input, sorry.



Thanks, you are not late and welcome to the party ;)

I'm hacking together a simple program to look over the data sectors and
XOR parity and determine how many, if any, discrepancies there are, and
at what offsets into the sector - byte by byte.

Running it on raw ZFS block component sectors, extracted with DD in the
ways I wrote of earlier in the thread, I did confirm some good sectors
and the one erroneous block that I have.

The latter turns out to have 4.5 worth of sectors in userdata, overall
laid out like this:
   dsk0  dsk1  dsk2  dsk3  dsk4  dsk5
   _ _ _ _ _ p1
   q1d0d2d3d4*   p2
   q2d1_ _ _ _

Here the compressed userdata is contained in the order of my d-sector
numbering, d0-d1-d2-d3-d4, and d4 is only partially occupied (P-size of
the block is 0x4c00) so its final quarter is all zeroes.

It also happens that on disks 1,2,3 the first row's sectors (d0, d2, d3)
are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes.

The neighboring blocks, located a few sectors away from this one, also
have compressed data and have some regular-looking patterns of bytes,
certainly no long stretches of zeroes.

However, the byte-by-byte XOR matching complains about the whole sector.
All bytes, except some 40 single-byte locations here and there, don't
XOR up to produce the expected (known from disk) value.

I did not yet try the second parity algorithm.

At least in this case, it does not seem that I would find an incantation
needed to recover this block - too many zeroes overlapping (at least 3
disks' data proven compromised), where I did hope for some shortcoming
in ZFS recombination exhaustiveness. In this case - it is indeed too
much failure to handle.

Now waiting for scrub to find me more test subjects - broken files ;)

Thnks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-04 Thread Jim Klimov


On 2012-12-03 18:23, Jim Klimov wrote:

On 2012-12-02 05:42, Jim Klimov wrote:

So... here are some applied questions:


Well, I am ready to reply a few of my own questions now :)


Continuing the desecration of my deceased files' resting grounds...


2) Do I understand correctly that for the offset definition, sectors
in a top-level VDEV (which is all of my pool) are numbered in rows
per-component disk? Like this:
  0  1  2  3  4  5
  6  7  8  9  10 11...

That is, offset % setsize = disknum?

If true, does such numbering scheme apply all over the TLVDEV,
so as for my block on a 6-disk raidz2 disk set - its sectors
start at (roughly rounded) offset_from_DVA / 6 on each disk,
right?

3) Then, if I read the ZFS on-disk spec correctly, the sectors of
the first disk holding anything from this block would contain the
raid-algo1 permutations of the four data sectors, sectors of
the second disk contain the raid-algo2 for those 4 sectors,
and the remaining 4 disks contain the data sectors?


My understanding was correct. For posterity, in the earlier set up
example I had an uncompressed 128KB block residing at the address
DVA[0]=0:590002c1000:3. Counting in my disks' 4KB sectors,
this is 0x590002c1000/0x1000 = 0x590002C1 or 1493172929 logical
offset into the TLVDEV number 0 (and the only one in this pool).

Given that this TLVDEV is a 6-disk raidz2 set, my expected offset
on each component drive is 1493172929/6 = 248862154.83 (.83=5/6),
starting from after the ZFS header (2 labels and a reservation,
amounting to 4MB = 1024*4KB sectors). So this block's allocation
covers 8 4KB-sectors starting at 248862154+1024 on disk5 and at
248862155+1024 on disks 0,1,2,3,4.

As my further tests showed, the sector-columns (not rows as I had
expected after doc-reading) from disks 1,2,3,4 do recombine into
the original userdata (sha256 checksum matches), so disks 5 and 0
should hold the two parities - how ever that is calculated:

# for D in 1 2 3 4; do dd bs=4096 count=8 conv=noerror,sync \
  if=/dev/dsk/c7t${D}d0s0 of=b1d${D}.img skip=248863179; done

# for D in 1 2 3 4; do for R in 0 1 2 3 4 5 6 7; do \
  dd if=/pool/test3/b1d${D}.img bs=4096 skip=$R count=1; \
  done; done  /tmp/d

Note that the latter can be greatly simplified as cat, which
also works to the same effect, and is faster:
# cat /pool/test3/b1d?.img  /tmp/d
However I left the difficult notation to use in experiments later on.

That is, the original 128KB block was cut into 4 pieces (my 4 data
drives in the 6-disk raidz2 set), and each 32Kb strip was stored
on a separate drive. Nice descriptive pictures in some presentations
suggested to me that the original block is stored sector by sector
rotating onto the next disk - the set of 4 sectors with 2 parity
sectors in my case being a single stripe for the RAID purposes.
This directly suggested that incomplete such stripes, such as
the ends of files or whole small files, would still have the two
parity sectors and a handful of data sectors.

Reality differs.

For undersized allocations, i.e. of compressed data, it is possible
to see P-sizes not divisible by 4 (disks) in 4KB sectors, however,
some sectors do apparently get wasted because the A-size in the DVA
is divisible by 6*4KB. With columnar allocation of disks, it is
easier to see why full stripes have to be used:

p1 p2 d1 d2 d3 d4
.  ,  1  5  9   13
.  ,  2  6  10  14
.  ,  3  7  11  x
.  ,  4  8  12  x

In this illustration a 14-sector-long block is saved, with X being
the empty leftovers, on which we can't really save (as would be the
case with the other allocation, which is likely less efficient for
CPU and IOs).

The metadata blocks do have A-sizes of 0x3000 (2 parity + 1 data),
at least, which on 4KB-sectored disks is also pretty much for these
miniature data objects - but not as sad as 6*4KB would have been ;)

It also seems that the instinctive desire to have raidzN sets of
4*M+N disks (i.e. 6-disk raidz2, 11-disk raidz3, etc.) which was
discussed over and over on the list a couple of years ago, may
still be valid with typical block sizes being powers of two...
Even though gurus said that this should not matter much.
For IOPS - maybe not. For wasted space - likely...



I'm almost ready to go and test Q2 and Q3, however, the questions
which regard useable tools (and what data should be fed into such
tools?) are still on the table.


 Some OLD questions remain raised, just in case anyone answers them.

 3b) The redundancy algos should in fact cover other redundancy disks
 too (in order to sustain loss of any 2 disks), correct? (...)

 4) Where are the redundancy algorithms specified? Is there any simple
 tool that would recombine a given algo-N redundancy sector with
 some other 4 sectors from a 6-sector stripe in order to try and
 recalculate the sixth sector's contents? (Perhaps part of some
 unit tests?)

 7) Is there a command-line tool to do lzjb compressions

Re: [zfs-discuss] zpool rpool rename offline

2012-12-03 Thread Jim Klimov


On 2012-12-03 01:15, Phillip Wagstrom wrote:

You can't change the name of a zpool without importing it.

For what you're attempting to do, why not attach a larger vdisk and mirror the 
existing disk in rpool?  Then drop the smaller vdisk and you'll have a larger 
rpool.



In general, I'd do the renaming with a different bootable media,
including a LiveCD/LiveUSB, another distro that can import and
rename this pool version, etc. - as long as booting does not
involve use of the old rpool.

Phillip however has a good point about mirroring onto a larger
disk. This should also carry over your old pool's attributes
(bootfs, name, etc.) - however you should likely have to use
installgrub on the new disk image. When you detach the old
mirror half, you'd automatically have a larger pool on the
remaining disk image.


dom0 # xm list $vm -l | egrep vbd|:disk|zvol
(vbd
(dev xvda:disk)
(uname phy:/dev/zvol/dsk/rpool/zvol/domu-2-root)
(vbd
(dev xvdb:disk)
(uname phy:/dev/zvol/dsk/rpool/zvol/domu-21-root)



By far, the easiest approach in your case would be to just
increase the host's zfs volume which backs your old rpool
and use autoexpansion (or manual expansion) to let your VM's
rpool capture the whole increased virtual disk.

If automagic doesn't work, I posted about a month ago about
the manual procedure on this list:

http://mail.opensolaris.org/pipermail/zfs-discuss/2012-November/052712.html

HTH,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Eradicating full-disk pool headers safely

2012-12-03 Thread Jim Klimov


Hello all,

  When I started with my old test box (the 6-disk raidz2 pool), I had
first created the pool on partitions (i.e. c7t1d0p0 or physical paths
like /pci@0,0/pci1043,81ec@1f,2/disk@1,0:q), but I've soon destroyed
it and recreated (with the same name pool) in slices (i.e. c7t0d0s0
or   /pci@0,0/pci1043,81ec@1f,2/disk@1,0:a) with a tailing 8Mb slice
(whole-disk ZFS layout). The disks currently carry the format EFI, and
the zpool command finds the correct pool by name.

  However, whenever I use zdb, it finds leftovers of my original test
as labels number 2 and 3 (numbers 0 and 1 are failed to unpack), so
zdb refuses to use my pool by name and I have to provide the GUID.
Is it easy to find out at which locations ZDB finds these labels so I
could zero them out and let zdb use the correct pool by name?

  Should I assume that p0 addresses the while disk and wipe the last
512K of the disk size (which are now in the reserved 8Mb partition)?

  BTW, what role does this 8Mb piece play? I might guess it helps to
replace disks by new ones with similar (not exact) sizes and this
slice on the new disk would shrink or expand to cover up the HDD size
discrepancy. But I haven't done any replacements so far which would
prove or disprove this  ;)

Thanks,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

2012-12-03 Thread Jim Klimov


On 2012-12-02 05:42, Jim Klimov wrote:

So... here are some applied questions:


Well, I am ready to reply a few of my own questions now :)

I've staged an experiment by taking a 128Kb block from that file
and appending it to a new file in a test dataset, where I changed
the compression settings between the appendages. Thus I've got
a ZDB dump of three blocks with identical logical userdata and
different physical data.

# zdb -d -bb -e 1601233584937321596/test3 8  /pool/test3/a.zdb
...
Indirect blocks:
   0 L1  DVA[0]=0:59492a98000:3000 DVA[1]=
0:83e2f65000:3000 [L1 ZFS plain file] sha256 lzjb LE contiguous
unique double size=4000L/400P birth=326381727L/326381727P fill=3
cksum=2ebbfb189e7ce003:166a23fd39d583ed:f527884977645395:896a967526ea9cea

   0  L0 DVA[0]=0:590002c1000:3 [L0 ZFS plain file]
sha256 uncompressed LE contiguous unique single size=2L/2P
birth=326381721L/326381721P fill=1
cksum=3c691e8fc86de2ea:90a0b76f0d1fe3ff:46e055c32dfd116d:f2af276f0a6a96b9

   2  L0 DVA[0]=0:594928b8000:9000 [L0 ZFS plain file]
sha256 lzjb LE contiguous unique single size=2L/4800P
birth=326381724L/326381724P fill=1
cksum=57164faa0c1cbef4:23348aa9722f47d3:3b1b480dc731610b:7f62fce0cc18876f

   4  L0 DVA[0]=0:59492a92000:6000 [L0 ZFS plain file]
sha256 gzip-9 LE contiguous unique single size=2L/2800P
birth=326381727L/326381727P fill=1
cksum=d68246ee846944c6:70e28f6c52e0c6ba:ea8f94fc93f8dbfd:c22ad491c1e78530

segment [, 0008) size  512K


1)   So... how DO I properly interpret this to select sector ranges to
 DD into my test area from each of the 6 disks in the raidz2 set?

 On one hand, the DVA states the block length is 0x9000, and this
 matches the offsets of neighboring blocks.

 On the other hand, compressed physical data size is 0x4c00 for
 this block, and ranges 0x4800-0x5000 for other blocks of the file.
 Even multiplied by 1.5 (for raidz2) this is about 0x7000 and way
 smaller than 0x9000. For uncompressed files I think I saw entries
 like size=2L/3P, so I'm not sure even my multiplication
 by 1.5x above is valid, and the discrepancy between DVA size and
 interval, and physical allocation size reaches about 2x.

Apparently, my memory failed me. The values in size field regard the
userdata (compressed, non-redundant). Also I forgot to consider that
this pool uses 4KB sectors (ashift=12).

So my userdata which takes up about 0x4800 bytes would require 4.5
(rather, 5 whole) sectors and this warrants 4 sectors of the raidz2
redundancy on a 6-disk set - 2 sectors for the first 4 data sectors,
and 2 sectors for the remaining half-sector's worth of data. This
does sum up to 9*0x1000 bytes in whole-sector counting (as in offsets).

However, the gzip-compressed block above which only has 0x2800 bytes
of userdata and requires 3 sectors plus 2 redundancy sectors, still
has a DVA size of six 4KB sectors (0x6000). This is strange to me -
I'd expect 5 sectors for this block altogether... does anyone have
an explanation? Also, what should the extra userdata sector contain
physically - zeroes?

 5) Is there any magic to the checksum algorithms? I.e. if I pass
 some 128KB block's logical (userdata) contents to the command-line
 sha256 or openssl sha256 - should I get the same checksum as
 ZFS provides and uses?

The original 128KB file's sha256 checksum matches the uncompressed
block's ZFS checksum, so in my further tests I can use the command
line tools to verify the recombined results:

# sha256sum /tmp/b128
3c691e8fc86de2ea90a0b76f0d1fe3ff46e055c32dfd116df2af276f0a6a96b9  /tmp/b128

No magic, as long as there are useable command-line implementations
of the needed algos (sha256sum is there, fletcher[24] are not).

 6) What exactly does a checksum apply to - the 128Kb userdata block
 or a 15-20Kb (lzjb-)compressed portion of data? I am sure it's the
 latter, but ask just in case I don't miss anything... :)

ZFS parent block checksum applies to the on-disk variant of userdata
payload (compression included, redundancy excluded).



NEW QUESTIONS:

7) Is there a command-line tool to do lzjb compressions and
decompressions (in the same blocky manner as would be applicable
to ZFS compression)?

I've also tried to gzip-compress the original 128KB file, but
none of the compressed results (with varying gzip level) yielded
the same checksum that would match the ZFS block's one.
Zero-padding to 10240 bytes (psize=0x2800) did not help.


8) When should the decompression stop - as soon as it has extracted
the logical-size number of bytes (i.e. 0x2)?


9) Physical sizes magically are in whole 512b units, so it seems...
I doubt that the compressed data would always end at such boundary.

How many bytes should be covered by a checksum?
Are the 512b blocks involved zero-padded at ends (on disk and/or RAM)?



Some OLD questions remain raised, just in case

Re: [zfs-discuss] zpool rpool rename offline

2012-12-03 Thread Jim Klimov


On 2012-12-03 20:35, Heiko L. wrote:

I've already tested:
beadm create -p $dstpool $bename
beadm list
zpool set bootfs=$dstpool/ROOT/$bename $dstpool
beadm activate $bename
beadm list
init 6

- result:
root@opensolaris:~# init 6
updating //platform/i86pc/boot_archive
updating //platform/i86pc/amd64/boot_archive

Hostname: opensolaris
WARNING: pool 'rpool1' could not be loaded as it was last accessed by another 
system (host: opensolaris hostid: 0xc08358). See:
http://www.sun.com/msg/ZFS-8000-EY
...hang...

   - seen to be a bug...


You wrote you use opensolaris - if literally true, this is quite old
and few people would say definitely which bugs to expect in which
version.
I might guess (hope) your ultimate goal by increasing the disk is to
upgrade the VM to a more current build, like OI or Sol11?

Still, while booted from the old rpool, after activating the new one,
you could also zpool export rpool1 in order to mark it as cleanly
exported and not potentially held by another OS instance. This should
allow to boot from it, unless some other bug steps in...

//Jim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool rpool rename offline

2012-12-03 Thread Jim Klimov


On 2012-12-03 20:51, Heiko L. wrote:

jimklimov wrote:


In general, I'd do the renaming with a different bootable media,
including a LiveCD/LiveUSB, another distro that can import and
rename this pool version, etc. - as long as booting does not
involve use of the old rpool.

Thank you. I will test it in the coming days.


Well then, hopefully this (other media to boot) will help with that
(forcing the old rpool to expand)...


If automagic doesn't work, I posted about a month ago about
the manual procedure on this list:

http://mail.opensolaris.org/pipermail/zfs-discuss/2012-November/052712.html

procedure work on 2. disk,
but i cannot use zpool import on rpool...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 6Tb Database with ZFS

2012-12-01 Thread Jim Klimov


On 2012-12-01 15:05, Fung Zheng wrote:

Hello,

Im about to migrate a 6Tb database from Veritas Volume Manager to ZFS, I
want to set arc_max parameter so ZFS cant use all my system's memory,
but i dont know how much i should set, do you think 24Gb will be enough
for a 6Tb database? obviously the more the better but i cant set too
much memory. Have someone implemented succesfully something similar?


Not claiming to be an expert fully ready to (mis)lead you (and I
haven't done similar quests for databases), I might suggest that
you set the ZFS dataset option primarycache=metadata on your
dataset which holds the database. (PS: what OS version are you on?)

The general consent is that serious apps like databases are better
than generic OS/FS caches at caching what the DBMS deems fit (and
the data blocks might get cached twice - in ARC and in app cache),
however having ZFS *metadata* cached should speed up your HDD IO -
the server might keep the {much of} needed block map in RAM and not
have to start by fetching it from disks every time.

Also make sure to set the recordsize attribute as appropriate for
your DB software - to match the DB block size. Usually this ranges
around 4, 8 or 16Kb (with zfs default being 128Kb for filesystem
datasets). You might also want to put non-tablespace files (logs,
indexes, etc.) into separate datasets with their appropriate record
sizes - this would let you play with different caching and compression
settings, if applicable (you might save some IOPS by reading and
writing less mechanical data at a small hit to CPU horsepower by
using LZJB).

Also such systems tend to benefit from SSD L2ARC read-caches and
SSD SLOG (ZIL) write-caches. These are different pieces of equipment
with distinct characteristics (SLOG is mirrored, small, write-mostly,
and should endure write-wear and survive sudden poweroffs; L2ARC is
big, fast for small random reads, moderately reliable).

If you do use a big L2ARC, you might indeed want to have both ZFS
caches for frequently accessed datasets (i.e. index) to hold both
the userdata and metadata (as is the default), while the randomly
accessed tablespaces might be or not be good candidates for such
caching - however you can test this setting change on the fly.
I believe, you must allow caching userdata for a dataset in RAM
if you want to let it spill over onto L2ARC.

HTH,
//Jim Klimov




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Digging in the bowels of ZFS

2012-12-01 Thread Jim Klimov

 such combo and ZFS
does what it should exhaustively and correctly, indeed ;)

Thanks a lot in advance for any info, ideas, insights,
and just for reading this long post to the end ;)
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Remove disk

2012-11-30 Thread Jim Klimov


On 2012-11-30 15:52, Tomas Forsman wrote:

On 30 November, 2012 - Albert Shih sent me these 0,8K bytes:


Hi all,

I would like to knwon if with ZFS it's possible to do something like that :

http://tldp.org/HOWTO/LVM-HOWTO/removeadisk.html


Removing a disk - no, one still can not reduce the amount of devices
in a zfs pool nor change raidzN redundancy levels (you can change
single disks to mirrors and back), nor reduce disk size.

As Tomas wrote, you can increase the disk size by replacing smaller
ones with bigger ones.

With sufficiently small starting disks and big new disks (i.e. moving
up from 1-2Tb to 4Tb) you can cheat by putting several partitions
on one drive and giving that to different pool components - if your
goal is to reduce the amount of hardware disks in the pool.

However, note that:

1) A single HDD becomes a SPOF, so you should put pieces of different
raidz sets onto particular disks - if a HDD dies, it does not bring
down a critical amount of pool components and does not kill the pool.

2) The disk mechanics will be torn between many requests to your
pool's top-level VDEVs, probably greatly reducing achievable IOPS
(since the TLVDEVs are accessed in parallel).

So while possible, this cheat is useful as a temporary measure -
i.e. while you migrate data and don't have enough drive bays to
hold the old and new disks, and want to be on the safe side by not
*removing* a good disk in order to replace it with a bigger one.
With this cheat you have all data safely redundantly stored on
disks at all time during migration. In the end this disk can be
the last piece of the puzzle in your migration.



meaning :

I have a zpool with 48 disks with 4 raidz2 (12 disk). Inside those 48 disk
I've 36x 3T and 12 x 2T.
Can I buy new 12x4 To disk put in the server, add in the zpool, ask zpool
to migrate all data on those 12 old disk on the new and remove those old
disk ?


You pull out one 2T, put in a 4T, wait for resilver (possibly tell it to
replace, if you don't have autoreplace on)
Repeat until done.
If you have the physical space, you can first put in a new disk, tell it
to replace and then remove the old.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS QoS and priorities

2012-11-29 Thread Jim Klimov


I've heard a claim that ZFS relies too much on RAM caching, but
implements no sort of priorities (indeed, I've seen no knobs to
tune those) - so that if the storage box receives many different
types of IO requests with different administrative weights in
the view of admins, it can not really throttle some IOs to boost
others, when such IOs have to hit the pool's spindles.

For example, I might want to have corporate webshop-related
databases and appservers to be the fastest storage citizens,
then some corporate CRM and email, then various lower priority
zones and VMs, and at the bottom of the list - backups.

AFAIK, now such requests would hit the ARC, then the disks if
needed - in no particular order. Well, can the order be made
particular with current ZFS architecture, i.e. by setting
some datasets to have a certain NICEness or another priority
mechanism?

Thanks for info/ideas,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

2012-11-28 Thread Jim Klimov


Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:

There are very few situations where (gzip) option is better than the
default lzjb.


Well, for the most part my question regarded the slowness (or lack of)
gzip DEcompression as compared to lz* algorithms. If there are files
and data like the OS (LZ/GZ) image and program binaries, which are
written once but read many times, I don't really care how expensive
it is to write less data (and for an OI installation the difference
between lzjb and gzip-9 compression of /usr can be around or over
100Mb's) - as long as I keep less data on-disk and have less IOs to
read in the OS during boot and work. Especially so, if - and this is
the part I am not certain about - it is roughly as cheap to READ the
gzip-9 datasets as it is to read lzjb (in terms of CPU decompression).
//Jim











___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

2012-11-27 Thread Jim Klimov


Performance-wise, I think you should go for mirrors/raid10, and
separate the pools (i.e. rpool mirror on SSD and data mirror on
HDDs). If you have 4 SSDs, you might mirror the other couple for
zoneroots or some databases in datasets delegated into zones,
for example. Don't use dedup. Carve out some space for L2ARC.
As Ed noted, you might not want to dedicate much disk space due
to remaining RAM pressure when using the cache; however, spreading
the IO load between smaller cache partitions/slices on each SSD
may help your IOPS on average. Maybe go for compression.

I really hope someone better versed in compression - like Saso -
would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in
terms of read-speeds from the pools. My HDD-based assumption is
in general that the less data you read (or write) on platters -
the better, and the spare CPU cycles can usually take the hit.

I'd spread out the different data types (i.e. WORM programs,
WORM-append logs and random-io other application data) into
various datasets with different settings, backed by different
storage - since you have the luxury.

Many best practice documents (and original Sol10/SXCE/LiveUpgrade
requirements) place the zoneroots on the same rpool so they can
be upgraded seamlessly as part of the OS image. However you can
also delegate ZFS datasets into zones and/or have lofs mounts
from GZ to LZ (maybe needed for shared datasets like distros
and homes - and faster/more robust than NFS from GZ to LZ).
For OS images (zoneroots) I'd use gzip-9 or better (likely lz4
when it gets integrated), same for logfile datasets, and lzjb,
zle or none for the random-io datasets. For structured things
like databases I also research the block IO size and use that
(at dataset creation time) to reduce extra work with ZFS COW
during writes - at expense of more metadata.

You'll likely benefit from having OS images on SSDs, logs on
HDDs (including logs from the GZ and LZ OSes, to reduce needless
writes on the SSDs), and databases on SSDs. Things depend for
other data types, and in general would be helped by L2ARC on
the SSDs.

Also note that much of the default OS image is not really used
(i.e. X11 on headless boxes), so you might want to do weird
things with GZ or LZ rootfs data layouts - note that these might
puzzle your beadm/liveupgrade software, so you'll have to do
any upgrades with lots of manual labor :)

On a somewhat orthogonal route, I'd start with setting up a
generic dummy zone, perhaps with much unneeded software,
and zfs-cloning that to spawn application zones. This way
you only pay the footprint price once, at least until you
have to upgrade the LZ OSes - in that case it might be cheaper
(in terms of storage at least) to upgrade the dummy, clone it
again, and port the LZ's customizations (installed software)
by finding the differences between the old dummy and current
zone state (zfs diff, rsync -cn, etc.) In such upgrades you're
really well served by storing volatile data in separate datasets
from the zone OS root - you just reattach these datasets to the
upgraded OS image and go on serving.

As a particular example of the thing often upgraded and taking
considerable disk space per copy - I'd have the current JDK
installed in GZ: either simply lofs-mounted from GZ to LZs,
or in a separate dataset, cloned and delegated into LZs (if
JDK customizations are further needed by some - but not all -
local zones, i.e. timezone updates, trusted CA certs, etc.).

HTH,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools

2012-11-27 Thread Jim Klimov


Now that I thought of it some more, a follow-up is due on my advices:

1) While the best practices do(did) dictate to set up zoneroots in
   rpool, this is certainly not required - and I maintain lots of
   systems which store zones in separate data pools. This minimizes
   write-impact on rpools and gives the fuzzy feeling of keeping
   the systems safer from unmountable or overfilled roots.

2) Whether LZs and GZs are in the same rpool for you, or you stack
   tens of your LZ roots in a separate pool, they do in fact offer
   a nice target for dedup - with expected large dedup ratio which
   would outweigh both the overheads and IO lags (especially if it
   is on SSD pool) and the inconveniences of my approach with cloned
   dummy zones - especially upgrades thereof. Just remember to use
   the same compression settings (or lack of compression) on all
   zoneroots, so that the zfs blocks for OS image files would be
the same and dedupable.

HTH,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Directory is not accessible

2012-11-26 Thread Jim Klimov


On 2012-11-26 15:15, The OP wrote:

How can one remove a directory containing corrupt files or a corrupt file
itself? For me rm just gives input/output error.


I believe you can get rid of the corrupt files by overwriting them.
In my case of corrupted files, I dd'ed the corrupt blocks from a backup
source into the right spot of the file. Overall this released the corrupt
blocks from the pool and allowed them to get freed (or perhaps leaked in
case of that bug I've stepped onto).
Trying to free the block can get your pool into trouble or panics,
depending on the nature of the corruption, though (in my case, DDT
was trying to release a block that was not entered into the DDT).
If this happens, your next best bet would be to trace where the
error happens, invent a patch (such as letting it possibly leak
away) and compile your own kernel to clean up the pool.

Of course, it is also possible that the block would go away (if it
is not referenced also by snapshots/clones/dedup), and such drastic
measures won't be needed.

HTH,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS Appliance as a general-purpose server question

2012-11-22 Thread Jim Klimov


A customer is looking to replace or augment their Sun Thumper
with a ZFS appliance like 7320. However, the Thumper was used
not only as a protocol storage server (home dirs, files, backups
over NFS/CIFS/Rsync), but also as a general-purpose server with
unpredictably-big-data programs running directly on it (such as
corporate databases, Alfresco for intellectual document storage,
etc.) in order to avoid the networking transfer of such data
between pure-storage and compute nodes - this networking was
seen as both a bottleneck and a possible point of failure.

Is it possible to use the ZFS Storage appliances in a similar
way, and fire up a Solaris zone (or a few) directly on the box
for general-purpose software; or to shell-script administrative
tasks such as the backup archive management in the global zone
(if that concept still applies) as is done on their current
Solaris-based box?

Is it possible to run VirtualBoxes in the ZFS-SA OS, dare I ask? ;)

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Appliance as a general-purpose server question

2012-11-22 Thread Jim Klimov


On 2012-11-22 17:31, Darren J Moffat wrote:

Is it possible to use the ZFS Storage appliances in a similar
way, and fire up a Solaris zone (or a few) directly on the box
for general-purpose software; or to shell-script administrative
tasks such as the backup archive management in the global zone
(if that concept still applies) as is done on their current
Solaris-based box?


No it is a true appliance, it might look like it has Solaris underneath
but it is just based on Solaris.

You can script administrative tasks but not using bash/ksh style
scripting you use the ZFSSA's own scripting language.


So, the only supported (or even possible) way is indeed to us it
as NAS for file or block IO from another head running the database
or application servers?..

In the Datasheet I read that Cloning and Remote replication are
separately licensed features; does this mean that the capability
for zfs send|zfs recv backups from remote Solaris systems should
be purchased separately? :(

I wonder if it would make weird sense to get the boxes, forfeit the
cool-looking Fishworks, and install Solaris/OI/Nexenta/whatever to
get the most flexibility and bang for a buck from the owned hardware...
Or, rather, shop for the equivalent non-appliance servers...

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] mixing WD20EFRX and WD2002FYPS in one pool

2012-11-21 Thread Jim Klimov


On 2012-11-21 16:45, Eugen Leitl wrote:

Thanks, this is great to know. The box will be headless, and
run in text-only mode. I have an Intel NIC in there, and don't
intend to use the Realtek port for anything serious.


My laptop based on AMD E2 VISION integrated CPU and Realtek Gigabit
had intermittent problems with rge driver (intr count went to
about 100k/sec and X11 locked up until I disconnected the LAN),
but these diminished or disappeared after I switched to gani
driver (source available from internet).

OI lacks support for the Radeon chips in my CPU (works as vesavga).
And USB3.


I intend to boot off USB flash stick, and runn OI with napp-it.
8 GByte RAM, unfortunately not ECC, but it will do for a secondary
SOHO NAS, as data is largely read-only.


Theoretically, if memory has a hiccup while scrub verifies your
disks, it can cause phantom checksum mismatches to be detected.
I am not sure about timing of reads and other events involved
in further reconstitution of the data - whether the recovery
attempt will use the re-read (and possibly correct) sector data
or if it will continue based on invalid buffer contents.

I guess ZFS being on the safe side should double-check the found
discrepancies and those sectors it's going to use to recover a
block, at least of the kernel knows it is on non-ECC RAM (if it
does), but I don't know if it really does that. (Worthy RFE if not).

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intel DC S3700

2012-11-21 Thread Jim Klimov


On 2012-11-21 21:55, Ian Collins wrote:

I can't help thinking these drives would be overkill for an ARC device.
All of the expensive controller hardware is geared to boosting random
write IOPs, which somewhat wasted on a write slowly, read often device.
The enhancements would be good for a ZIL, but the smallest drive is at
least an order of magnitude too big...


I think, given the write-endurance and powerloss protection, these
devices might make for good pool devices - whether for an SSD-only
pool, or for an rpool+zil(s) mirrors with main pools (and likely
L2ARCs, yes) being on different types of devices.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread Jim Klimov


On 2012-11-21 03:21, nathan wrote:

Overall, the pain of the doubling of bandwidth requirements seems like a
big downer for *my* configuration, as I have just the one SSD, but I'll
persist and see what I can get out of it.


I might also speculate that for each rewritten block of userdata in
the VM image, you have a series of metadata block updates in ZFS.
If you keep the zvol blocks relatively small, you might get the
effective doubling of writes for the userdata updates.

As for ZIL - even if it is used with the in-pool variant, I don't
think your setup needs any extra steps to disable it (as Edward likes
to suggest), and most other setups don't need to disable it either.
It also shouldn't add much to your writes - the in-pool ZIL blocks
are then referenced as userdata when the TXG commit happens (I think).

I also think that with a VM in a raw partition you don't get any
snapshots - neither ZFS as underlying storage ('cause it's not),
not hypervisor snaps of the VM. So while faster, this is also some
trade-off :)

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Jim Klimov


On 2012-11-19 20:28, Peter Jeremy wrote:

Yep - that's the fallback solution.  With 1874 snapshots spread over 54
filesystems (including a couple of clones), that's a major undertaking.
(And it loses timestamp information).


Well, as long as you have and know the base snapshots for the clones,
you can recreate them at the same branching point on the new copy too.

Remember to use something like rsync -cavPHK --delete-after --inplace 
src/ dst/ to do the copy, so that the files removed from the source

snapshot are removed on target, the changes are detected thanks to
file checksum verification (not only size and timestamp), and changes
take place within the target's copy of the file (not as rsync's default
copy-and-rewrite) in order for the retained snapshots history to remain
sensible and space-saving.

Also, while you are at it, you can use different settings on the new
pool, based on your achieved knowledge of your data - perhaps using
better compression (IMHO stale old data that became mostly read-only
is a good candidate for gzip-9), setting proper block sizes for files
of databases and disk images, maybe setting better checksums, and if
your RAM vastness and data similarity permit - perhaps employing dedup
(run zdb -S on source pool to simulate dedup and see if you get any
better than 3x savings - then it may become worthwhile).

But, yes, this will take quite a while to effectively walk your pool
several thousand times, if you do the plain rsync from each snapdir.
Perhaps, if the zfs diff does perform reasonably for you, you can
feed its output as the list of objects to replicate in rsync's input
and save many cycles this way.

Good luck,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Jim Klimov


On 2012-11-19 20:58, Mark Shellenbaum wrote:

There is probably nothing wrong with the snapshots.  This is a bug in
ZFS diff.  The ZPL parent pointer is only guaranteed to be correct for
directory objects.  What you probably have is a file that was hard
linked multiple times and the parent pointer (i.e. directory) was
recycled and is now a file


Interesting... do the ZPL files in ZFS keep pointers to parents?

How in the COW transactiveness could the parent directory be
removed, and not the pointer to it from the files inside it?
Is this possible in current ZFS, or could this be a leftover
in the pool from its history with older releases?

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Jim Klimov


Oh, and one more thing: rsync is only good if your filesystems don't
really rely on ZFS/NFSv4-style ACLs. If you need those, you are stuck
with Solaris tar or Solaris cpio to carry the files over, or you have
to script up replication of ACLs after rsync somehow.

You should also replicate the local zfs attributes of your datasets,
zfs allow permissions, ACLs on .zfs/shares/* (if any, for CIFS) -
at least of their currently relevant live copies, which is also not a
fatally difficult scripting (I don't know if it is possible to fetch
the older attribute values from snapshots - which were in force at
that past moment of time; if somebody knows anything on this - plz
write).

On another note, to speed up the rsyncs, you can try to save on the
encryption (if you do this within a trusted LAN) - use rsh, or ssh
with arcfour or none enc. algos, or perhaps rsync over NFS as
if you are in the local filesystem.

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing corrupted ZFS pool

2012-11-19 Thread Jim Klimov


On 2012-11-19 22:38, Mark Shellenbaum wrote:

The parent pointer is a single 64 bit quantity that can't track all the
possible parents a hard linked file could have.


I believe it is inode number of the parent, or similar to that - and
an available inode number can get recycled and used by newer objects?


Now when the original dir.2 object number is recycled you could have a
situation where the parent pointer for points to a non-directory.

The ZPL never uses the parent pointer internally.  It is only used by
zfs diff and other utility code to translate object numbers to full
pathnames.  The ZPL has always set the parent pointer, but it is more
for debugging purposes.


Thanks, very interesting!

Now that this value is used and somewhat exposed to users, isn't it
time to replace it with some nvlist or a different object type that
would hold all such parent pointers for hardlinked files (perhaps,
when moving from a single integer to nvlist if we have more than one
link from a directory to a file inode)? At least, it would make zdiff
more consistent and reliable, though at a cost of some complexity...
inodes do already track their reference counts. If we keep track of
one referrer explicitly, why not track them all?

Thanks for info,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot

2012-11-16 Thread Jim Klimov


On 2012-11-15 21:43, Geoff Nordli wrote:

Instead of using vdi, I use comstar targets and then use vbox built-in
scsi initiator.


Out of curiosity: in this case are there any devices whose ownership
might get similarly botched, or you've tested that this approach also
works well for non-root VMs?

Did you measure any overheads of initiator-target vs. zvol, both being
on the local system? Is there any significant performance difference
worth thinking and talking about?

Thanks,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot

2012-11-16 Thread Jim Klimov


On 2012-11-16 12:43, Robert Milkowski wrote:

No, there isn’t other way to do it currently. SMF approach is probably
the best option for the time being.

I think that there should be couple of other properties for zvol where
permissions could be stated.


+1 :)
Well, when the subject was discussed a month ago, I posted a couple
of RFEs, lest the problem be quietly forgotten:

https://www.illumos.org/issues/3283
ZFS: correctly remember device node ownership and ACLs for ZVOLs

https://www.illumos.org/issues/3284
ACLs on device node can become applied to wrong devices; UID/GID not 
retained


While trying to find workarounds for Edward's problem, I discovered
that NFSv4/ZFS-style ACLs can be applied to /devices/* and are even
remembered across reboots, but in fact this causes more problems
than solutions.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot

2012-11-16 Thread Jim Klimov


Well, as a simple stone-age solution (to simplify your SMF approach),
you can define custom attributes on dataset, zvols included. I think
a custom attr must include a colon : in the name, and values can be
multiline if needed. Simple example follows:

# zfs set owner:user=jim pool/rsvd
# zfs set owner:group=staff pool/rsvd
# zfs set owner:chmod=777 pool/rsvd
# zfs set owner:acl=`ls -vd .profile` pool/rsvd

# zfs get all pool/rsvd
...
pool/rsvd  owner:chmod777 






local

pool/rsvd  owner:acl  -rw-r--r--   1 root root 
54 Nov 11 22:21 .profile

 0:owner@:read_data/write_data/append_data/read_xattr/write_xattr
 /read_attributes/write_attributes/read_acl/write_acl/write_owner
 /synchronize:allow

1:group@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow
 2:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
 :allow  local

pool/rsvd  owner:groupstaff 






local

pool/rsvd  owner:user jim 






local

Then you can query the zvols for such attribute values and use them
in chmod, chown, ACL settings, etc. from your script. This way the
main goal is reached: the ownership config data stays within the pool.

HTH,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot

2012-11-16 Thread Jim Klimov


On 2012-11-16 14:45, Jim Klimov wrote:

Well, as a simple stone-age solution (to simplify your SMF approach),
you can define custom attributes on dataset, zvols included. I think
a custom attr must include a colon : in the name, and values can be
multiline if needed. Simple example follows:



Forgot to mention: to clear these custom values, you can just
zfs inherit them on this same dataset. As long as the parent
does not define them, they should just get wiped out.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intel DC S3700

2012-11-14 Thread Jim Klimov


On 2012-11-14 18:05, Eric D. Mudama wrote:

On Wed, Nov 14 at  0:28, Jim Klimov wrote:

All in all, I can't come up with anything offensive against it quickly
;) One possible nit regards the ratings being geared towards 4KB block
(which is not unusual with SSDs), so it may be further from announced
performance with other block sizes - i.e. when caching ZFS metadata.


Would an ashift of 12 conceivably address that issue?



Performance-wise (and wear-wise) - probably. Gotta test how bad it is
at 512b IOs ;) Also I am not sure if ashift applies to (can be set for)
L2ARC cache devices...

Actually, if read performance does not happen to suck at smaller block
sizes, ashift is not needed - the L2ARC writes seem to be streamed
sequentially (as in an infinite tape) so smaller writes would still
coalesce into big HW writes and not cause excessive wear by banging
many random flash cells. IMHO :)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Intel DC S3700

2012-11-13 Thread Jim Klimov


On 2012-11-13 22:56, Mauricio Tavares wrote:

Trying again:

Intel just released those drives. Any thoughts on how nicely they will
play in a zfs/hardware raid setup?


Seems interesting - fast, assumed reliable and consistent in its IOPS
(according to marketing talk), addresses power loss reliability (acc.
to datasheet):

* Endurance Rating - 10 drive writes/day over 5 years while running
JESD218 standard

* The Intel SSD DC S3700 supports testing of the power loss capacitor,
which can be monitored using the following SMART attribute: (175, AFh).

Somewhat affordably priced (at least in the volume market for shops
that buy hardware in cubic meters ;)

http://newsroom.intel.com/community/intel_newsroom/blog/2012/11/05/intel-announces-intel-ssd-dc-s3700-series--next-generation-data-center-solid-state-drive-ssd

http://download.intel.com/newsroom/kits/ssd/pdfs/Intel_SSD_DC_S3700_Product_Specification.pdf

All in all, I can't come up with anything offensive against it quickly 
;) One possible nit regards the ratings being geared towards 4KB block

(which is not unusual with SSDs), so it may be further from announced
performance with other block sizes - i.e. when caching ZFS metadata.

Thanks for bringing it into attention spotlight, and I hope the more
savvy posters would overview it better.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?

2012-11-13 Thread Jim Klimov


On 2012-11-14 03:20, Dan Swartzendruber wrote:

Well, I think I give up for now.  I spent quite a few hours over the last
couple of days trying to get gnome desktop working on bare-metal OI,
followed by virtualbox.  Supposedly that works in headless mode with RDP for
management, but nothing but fail for me.  Found quite a few posts on various
forums of people complaining that RDP with external auth doesn't work (or
not reliably), and that was my experience.



I can't say I used VirtualBox RDP extensively, certainly not in the
newer 4.x series, yet. For my tasks it sufficed to switch the VM
from headless to GUI and back via savestate, as automated by my
script from vboxsvc (vbox.sh -s vmname startgui for a VM config'd
as a vboxsvc SMF service already).

  The final straw was when I
 rebooted the OI server as part of cleaning things up, and... It hung. 
 Last
 line in verbose boot log is 'ucode0 is /pseudo/ucode@0'.  I 
power-cycled it

 to no avail.  Even tried a backup BE from hours earlier, to no avail.
 Likely whatever was bunged happened prior to that.  If I could get 
something

 that ran like xen or kvm reliably for a headless setup, I'd be willing to
 give it a try, but for now, no...

I can't say much about OI desktop problems either - works for me
(along with VBox 4.2.0 release), suboptimally due to lack of drivers,
but reliably.

Try to boot with -k option to use a kmdb debugger as well - maybe
the system would enter it upon getting stuck (does so instead of
rebooting when it is panicking) and you can find some more details
there?..

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Expanding a ZFS pool disk in Solaris 10 on VMWare (or other expandable storage technology)

2012-11-11 Thread Jim Klimov

%
  =   ==  =   ===   ==   ===
  1   ActiveSolaris   1  26092609100

   If you apply this technique to rpools, note that the partition
   type is different (SOLARIS vs EFI), the start cylinders differ
   (1 for MBR, 0 for EFI), and the bootable partition is Active.


6) The scary part is that I need to remove the partition and slice
   tables and recreate them starting at the same positions.

   So in fdisk I press 3 to delete the partition 1, then I press
   1 to create a new partition. If I select EFI, it automatically
   fills the disk from 0 to end. An MBR-based (Solaris2) partition
   started at 1 and asked me to enter desired size.

   For the disk dedicated fully to a pool, I chose EFI as it was
   originally.

   Now I press 5 to save the new partition table and return to
   format. Entering p,p I see that the slice sizes remain as
   they were...

   Returning to the disk-level menu, I entered t for Type:

format t
AVAILABLE DRIVE TYPES:
0. Auto configure
1. other
Specify disk type (enter its number)[1]: 0
c1t1d0: configured with capacity of 60.00GB
VMware-Virtual disk-1.0-60.00GB
selecting c1t1d0
[disk formatted]
/dev/dsk/c1t1d0s0 is part of active ZFS pool pool. Please see zpool(1M).

   I picked 0, et voila - the partition sizes are reassigned.

   Too early to celebrate however: the ZFS slice #0 now starts at a
   wrong position:

format p
partition p
Current partition table (default):
Total disk sectors available: 125812701 + 16384 (reserved sectors)

Part  TagFlag  First Sector Size Last Sector
  0usrwm 34   59.99GB  125812701
  1 unassignedwm  0   0   0
  2 unassignedwm  0   0   0
  3 unassignedwm  0   0   0
  4 unassignedwm  0   0   0
  5 unassignedwm  0   0   0
  6 unassignedwm  0   0   0
  8   reservedwm  1258127028.00MB  125829085

   Remembering that the original table started this slice at 256,
   and remembering the new table's last sector value, I mix the two:

partition 0
Part  TagFlag  First Sector Size Last Sector
  0usrwm 34   59.99GB  125812701

Enter partition id tag[usr]:
Enter partition permission flags[wm]:
Enter new starting Sector[34]: 256
Enter partition size[125812668b, 125812923e, 61431mb, 59gb, 0tb]: 125812701e

partition p
Current partition table (unnamed):
Total disk sectors available: 125812701 + 16384 (reserved sectors)

Part  TagFlag  First Sector Size Last Sector
  0usrwm256   59.99GB  125812701

   Finally, I can save the changed tables and exit format:

partition label
Ready to label disk, continue? y

partition q
format q

7) Inspecting the pool, and even exporting and importing it and
   inspecting again, I see that autoexpand did not take place and
   the pool is still 20Gb in size (dunno why - sol10u10 bug?) :(

   So I do the manual step:

# zpool online -e pool c1t1d0

   The -e flag marks the component as eligible for expansion.
   When all pieces of a top-level vdev become larger, the setting
   takes effect and the pool finally becomes larger:

# zpool list
NAMESIZE  ALLOC   FREECAP  HEALTH  ALTROOT
pool   59.9G   441M  59.4G 0%  ONLINE  -
rpool  19.9G  6.91G  13.0G34%  ONLINE  -

   Now I can finally go to my primary quest and install that
   large piece of software into a zone that lives on pool! ;)

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cannot replace X with Y: devices have different sector alignment

2012-11-10 Thread Jim Klimov


On 2012-11-10 17:16, Jan Owoc wrote:

Any other ideas short of block pointer rewrite?


A few... one is an idea of what could be the cause: AFAIK the
ashift value is not so much per-pool as per-toplevel-vdev.
If the pool started as a set of the 512b drives and was then
expanded to include sets of 4K drives, this mixed ashift could
happen...

It might be possible to override the ashift value with sd.conf
and fool the OS into using 512b sectors over a 4KB native disk
(this is mostly used the other way around, though - to enforce
4KB sectors on 4KB native drives that emulate 512b sectors).
This might work, and earlier posters on the list saw no evidence
to say that 512b emulation is inherently evil and unreliable
(modulo firmware/hardware errors that can be anywhere anyway),
but this would likely make the disk slower on random writes.

Also, I am not sure how the 4KB-native HDD would process partial
overwrites of a 4KB sector with 512b pieces of data - would other
bytes remain intact or not?..

Before trying to fool a production system this way, if at all,
I believe some stress-tests with small blocks are due on some
other system.

My 2c,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?

2012-11-09 Thread Jim Klimov

On 2012-11-09 16:14, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: Karl Wagner [mailto:k...@mouse-hole.com]

If I was doing this now, I would probably use the ZFS aware OS bare metal,
but I still think I would use iSCSI to export the ZVols (mainly due to the 
ability
to use it across a real network, hence allowing guests to be migrated simply)


Yes, if your VM host is some system other than your ZFS baremetal storage 
server, then exporting the zvol via iscsi is a good choice, or exporting your 
storage via NFS.  Each one has their own pros/cons, and I would personally be 
biased in favor of iscsi.

But if you're going to run the guest VM on the same machine that is the ZFS 
storage server, there's no need for the iscsi.



Well, since the ease of re-attachment of VM hosts to iSCSI was mentioned
a few times in this thread (and there are particular nuances with iSCSI
to localhost), it is worth mentioning that NFS files can be re-attached
just as easily - including the localhost.

Cloning disks is just as easy when they are zvols or files in dedicated
datasets; note that disk image UUIDs must be re-forged anyway (see doc).

Also note, that in general, there might be need for some fencing (i.e.
only one host tries to start up a VM from a particular backend image).
I am not sure iSCSI inherently does a better job than NFS at this?..

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?

2012-11-09 Thread Jim Klimov

On 2012-11-09 16:11, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: Dan Swartzendruber [mailto:dswa...@druber.com]

I have to admit Ned's (what do I call you?)idea is interesting.  I may give
it a try...


Yup, officially Edward, most people call me Ned.

I contributed to the OI VirtualBox instructions.  See here:
http://wiki.openindiana.org/oi/VirtualBox

Jim's vboxsvc is super powerful


Thanks for kudos, and I'd also welcome some on the SourceForge
project page :)

http://sourceforge.net/projects/vboxsvc/

 for now, if you find it confusing in any way, just ask for help here. 
 (Right Jim?)


I'd prefer the questions and discussion on vboxsvc to continue in
the VirtualBox forum, so it's all in one place for other users too.
It is certainly an offtopic for the lists about ZFS, so I won't
take this podium for too long :)

https://forums.virtualbox.org/viewtopic.php?f=11t=33249

 One of these days I'm planning to contribute a Quick Start guide to 
vboxsvc,


I agree that the README might need cleaning up, so far it is like a
snowball growing with details and new features. Perhaps some part
should be separated into a concise quick-start guide that would
not scare people off by the sheer amount of letters ;)

I don't think I can point to a chapter and say Take this as the
QuickStart :(

 - But at first I found it overwhelming, mostly due to unfamiliarity 
with SMF.


The current README does, however, provide an overview of SMF as was
needed by some of the inquiring users, and an example on command-line
creation of a service to wrap a VM. A feature to do this by the script
itself is pending, somewhat indefinitely.



Also note that for OI desktop users in particular (and likely for
other OpenSolaris-based OSes with X11 too), I'm now adding features
to ease management of VMs that are not executed headless, but rather are 
interactive. Now these can too be wrapped as SMF services to

automate shutdown and/or backgrounding into headless mode and back.
I made and use this myself to enter other OSes on my laptop that
are dual-bootable and can run in VBox as well as on hardware.
There is also a new foregrounding startgui mode that can trap the
signals which stop its terminal, and properly savestate or shutdown
the VM, as well as this wraps taking of ZFS snapshots for VM disk
resources, if applicable. There is also a mode where this spawns
a dedicated xterm for the script's execution; by closing the xterm
you can properly stop the VM with the preselected method of your
choice with one click, before you log out of X11 session.

However, this part of my work was almost in vain - the end of X11
session happens as a bruteforce close of X-connections, so the
interactive GUIs just die before they can process any signals.
This makes sense for networked X-servers that can't really send
signals to remote client OSes, but is rather stupid for local OS.
I hope the desktop environment gurus might come up with something.

Or perhaps I'll come up with an SMF wrapper for X sessions that the
vbox startgui feature could depend on, and the close of a session
would be an SMF disablement. Hopefully, spawned local X-clients would
also be under the SMF contract and would get chances to stop properly :)

Anyway, if anybody else is interested in the new features described
above - check out the code repository for the vboxsvc project (this
is not yet so finished as to publish a new package version):

http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/lib/svc/method/vbox.sh
http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/var/svc/manifest/site/vbox-svc.xml
http://vboxsvc.svn.sourceforge.net/viewvc/vboxsvc/usr/share/doc/vboxsvc/README-vboxsvc.txt

See you in the VirtualBox forum thread if you do have questions :)
//Jim Klimov


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Forcing ZFS options

2012-11-09 Thread Jim Klimov


There are times when ZFS options can not be applied at the moment,
i.e. changing desired mountpoints of active filesystems (or setting
a mountpoint over a filesystem location that is currently not empty).

Such attempts now bail out with messages like:
cannot unmount '/var/adm': Device busy
cannot mount '/export': directory is not empty

and such.

Is it possible to force the new values to be saved into ZFS dataset
properties, so they do take effect upon next pool import?

I currently work around the harder of such situations with a reboot
into a different boot environment or even into a livecd/failsafe,
just so that the needed datasets or paths won't be busy and so I
can set, verify and apply these mountpoint values. This is not a
convenient way to do things :)

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Forcing ZFS options

2012-11-09 Thread Jim Klimov


On 2012-11-09 18:06, Gregg Wonderly wrote:
 Do you move the pools between machines, or just on the same physical 
machine?  Could you just use symlinks from the new root to the old root 
so that the names work until you can reboot?  It might be more practical 
to always use symlinks if you do a lot of moving things around, and then 
you wouldn't have to figure out how to do the reboot shuffle.  Instead, 
you could just shuffle the symlinks.


No, this concerns datasets within the machine. And symlinks often
don't cut it. For example, I've recently needed to switch '/var'
from an automounted filesystem dataset into a legacy one with the
mount from /etc/vfstab. I can't set the different mountpoint value
(legacy) while the OS is up and using the 'var', and I don't seem
to have a way to do this during reboot automatically (short of
crafting an SMF script that would fire early in the boot sequence -
but that's a workaround outside ZFS technology, as is using the
livecd or a failsafe boot or another BE).

A different example is that sometimes uncareful works with beadm
leave the root dataset with a non='/' mountpoint attribute. While
the proper rootfs is forced to mount at the root node, it is not
clean to have the discrepancy. However, I can not successfully
zfs set mountpoint=/ rpool/ROOT/bename while booted into this BE.

Forcing the attribute to save the value I need, so it takes effect
after reboot - that's what I am asking for (if that was not clear
from my first post).


Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedicated server running ESXi with no RAID card, ZFS for storage?

2012-11-08 Thread Jim Klimov

On 2012-11-08 05:43, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

you've got a Linux or Windows VM isnide of ESX, which is writing to a virtual 
disk, which ESX is then wrapping up inside NFS and TCP, talking on the virtual 
LAN to the ZFS server, which unwraps the TCP and NFS, pushes it all through the 
ZFS/Zpool layer, writing back to the virtual disk that ESX gave it, which is 
itself a layer on top of Ext3


I think this is a part where you disagree. The way I get all-in-ones,
the VM running a ZFS OS enjoys PCI-pass-through, so it gets dedicated
hardware access to the HBA(s) and harddisks at raw speeds, with no
extra layers of lags in between. So there are a couple of OS disks
where ESXi itself is installed, distros, logging and stuff, and the
other disks are managed by a ZFS in a VM and served back to ESXi
to store other VMs on the system.

Also, VMWare does not (AFAIK) use ext3, but their own VMFS which is,
among other things, cluster-aware (same storage can be shared by
several VMware hosts).

That said, on older ESX (with minimized RHEL userspace interface)
which was picky about only using certified hardware with virt-enabled
drivers, I did combine some disks served by the motherboard into a
Linux mdadm array (within the RHEL-based management OS) and exported
that to the vmkernel over NFS. Back then disk performance was indeed
abysmal whatever you do, so the NFS disks were not after all used to
store virtual disks, but rather distros and backups.

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 6 >

1 - 100 of 540 matches

Mail list logo