[zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread Nathan Kroenert

 Hi folks,  (Long time no post...)

Only starting to get into this one, so apologies if I'm light on detail, 
but...


I have a shiny SSD I'm using to help make some VirtualBox stuff I'm 
doing go fast.


I have a 240GB Intel 520 series jobbie. Nice.

I chopped into a few slices - p0 (partition table), p1 128GB, p2 60gb.

As part of my work, I have used it both as a RAW device (cxtxdxp1) and 
wrapped partition 1 with a virtualbox created VMDK linkage, and it works 
like a champ. :) Very happy with that.


I then tried creating a new zpool using partition 2 of the disk (zpool 
create c2d0p2) and then carved a zvol out of that (30GB), and wrapped 
*that* in a vmdk.


Still works OK and speed is good(ish) - but there are a couple of things 
in particular that disturb me:
 - Sync writes are pretty slow - only about 1/10th of what I thought I 
might get (about 15MB/s). ASync writes are fast - up to 150MB/s or more.
 - More worringly, it seems that writes are amplified by 2X in that if 
I write 100MB at the guest level, the underlying bare metal ZFS writes 
200M, as observed by iostat. This doesn't happen on the VM's that are 
using RAW slices.


Anyone have any thoughts on what might be happening here?

I can appreciate that if everything comes through as a sync write, it 
goes to the ZIL first, then to it's final resting place - but it seems a 
little over the top that it really is double.


I have also had a play with sync=, primarycache settings and a few other 
things but it doesn't seem to change the behavious


Again - I'm looking for thoughts here - as I have only really just 
started looking into this. Should I happen across anything interesting, 
I'll followup this post.


Cheers,

Nathan. :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread nathan

Hi folks,

some extra thoughts:

1. Don't question why. :) I'm playing and observing, so I ultimately 
know and understand the best way to do things! heh.
2. In fairness, asking why is entirely valid. ;) I'm not doing things to 
best practice just yet - I wanted the best performance for my VM's, 
which are all testing/training/playing VM's. I got *great* performance 
from the first RAW PARTITION I gave to VirtualBox. I wanted to do the 
same, but due to the way it wraps paritions, and Solaris complains that 
there is more than one Solaris2 partition on the disk when I try to 
install the second instance, I thought I'd give zvols a go.
3. The device I wrap as a VMDK is the RAW device. sigh. Of course, all 
writes will go through the ZIL, and of course we'll have to write twice 
as much. I should have seen that straight away, but was lacking sleep.
4. Note: I don't have a separate ZIL. The first partition I made was 
given directly to virtualbox. The second was used to create the zpool.


I'm going to have a play with using LVM md devices instead and see how 
that goes as well.


Overall, the pain of the doubling of bandwidth requirements seems like a 
big downer for *my* configuration, as I have just the one SSD, but I'll 
persist and see what I can get out of it.


Thanks for the thoughts thus far!

Cheers,

Nathan.

On 21/11/2012 8:33 AM, Fajar A. Nugraha wrote:

On Wed, Nov 21, 2012 at 12:07 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris)
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

Why are you parititoning, then creating zpool,

The common case it's often because they use the disk for something
else as well (e.g. OS), not only for zfs


and then creating zvol?

Because it enables you to do other stuff easier and faster (e.g.
copying files from the host) compared to using plain disk image files
(vmdk/vdi/vhd/whatever)


I think you should make the whole disk a zpool unto itself, and then carve out 
the 128G zvol and 60G zvol.  For that matter, why are you carving out multiple 
zvol's?  Does your Guest VM really want multiple virtual disks for some reason?

Side note:  Assuming you *really* just want a single guest to occupy the whole 
disk and run as fast as possible...  If you want to snapshot your guest, you 
should make the whole disk one zpool, and then carve out a zvol which is 
significantly smaller than 50%, say perhaps 40% or 45% might do the trick.

... or use sparse zvols, e.g. zfs create -V 10G -s tank/vol1

Of course, that's assuming you KNOW that you never max-out storage use
on that zvol. If you don't have control over that, then using smaller
zvol size is indeed preferable.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-29 Thread Nathan Kroenert

 Hi John,

Actually, last time I tried the whole AF (4k) thing, it's performance 
was worse than woeful.


But admittedly, that was a little while ago.

The drives were the seagate green barracuda IIRC, and performance for 
just about everything was 20MB/s per spindle or worse, when it should 
have been closer to 100MB/s when streaming. Things were worse still when 
doing random...


I'm actually looking to put in something larger than the 3*2TB drives 
(triple mirror for read perf) this pool has in it - preferably 3 * 4TB 
drives. (I don't want to put in more spindles - just replace the current 
ones...)


I might just have to bite the bullet and try something with current SW. :).

Nathan.


On 05/29/12 08:54 PM, John Martin wrote:

On 05/28/12 08:48, Nathan Kroenert wrote:


Looking to get some larger drives for one of my boxes. It runs
exclusively ZFS and has been using Seagate 2TB units up until now (which
are 512 byte sector).

Anyone offer up suggestions of either 3 or preferably 4TB drives that
actually work well with ZFS out of the box? (And not perform like
rubbish)...

I'm using Oracle Solaris 11 , and would prefer not to have to use a
hacked up zpool to create something with ashift=12.


Are you replacing a failed drive or creating a new pool?

I had a drive in a mirrored pool recently fail.  Both
drives were 1TB Seagate ST310005N1A1AS-RK with 512 byte sectors.
All the 1TB Seagate boxed drives I could find with the same
part number on the box (with factory seals in place)
were really ST1000DM003-9YN1 with 512e/4196p.  Just being
cautious, I ended up migrating the pools over to a pair
of the new drives.  The pools were created with ashift=12
automatically:

  $ zdb -C | grep ashift
  ashift: 12
  ashift: 12
  ashift: 12

Resilvering the three pools concurrently went fairly quickly:

  $ zpool status
scan: resilvered 223G in 2h14m with 0 errors on Tue May 22 
21:02:32 2012
scan: resilvered 145G in 4h13m with 0 errors on Tue May 22 
23:02:38 2012
scan: resilvered 153G in 3h44m with 0 errors on Tue May 22 
22:30:51 2012


What performance problem do you expect?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-29 Thread nathan

On 29/05/2012 11:10 PM, Jim Klimov wrote:

2012-05-29 16:35, Nathan Kroenert wrote:

Hi John,

Actually, last time I tried the whole AF (4k) thing, it's performance
was worse than woeful.

But admittedly, that was a little while ago.

The drives were the seagate green barracuda IIRC, and performance for
just about everything was 20MB/s per spindle or worse, when it should
have been closer to 100MB/s when streaming. Things were worse still when
doing random...


On one hand, it is possible that being green, the drives aren't very
capable of fast IO - they had different design goals and tradeoffs.


Indeed! I just wasn't expecting it to be so profound.

But actually I was going to ask if you paid attention to partitioning?
At what offsets did your ZFS pool data start? Was that offset divisible
by 4KB (i.e. 256 512byte sectors as is default now vs 34 sectors of
the older default)?
It was. Actually I tried it in a variety of ways, including auto EFI 
partition (zpool create with the whole disk), using SMI label, and 
trying a variety of tricks with offsets. Again, it was a while ago - 
before the time of the SD RMW fix...


If the drive had 4kb native sectors but the logical FS blocks were
not aligned with that, then every write IO would involve RMW of
many sectors (perhaps disk's caching might alleviate this for
streaming writes though).
Yep - that's what it *felt* like, and I didn't seem to be able to change 
that at the time.


Also note that ZFS IO often is random even for reads, since you
have to read metadata and file data often from different dispersed
locations. Again, OS caching helps statistically, when you have
much RAM dedicated to caching. Hmmm... did you use dedup in those
tests?- that is another source of performance degradation on smaller
machines (under tens of GBs of RAM).


At the time, I had 1TB of data, and 1TB of space... I'd expect that most 
of the data would have been written 'closeish' to sequential on disk, 
though I'll confess I only spent a short time looking at the 'physical' 
read/write locations being send down through the stack. (where the drive 
writes them - well.. That's different. ;)


I have been contacted off list by a few folks that have indicated 
success with current drives and current Solaris bits. I'm thinking that 
it might be time to take another run at it.


I'll let the list know the results. ;)

Cheers

Nathan.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-28 Thread Nathan Kroenert

 Hi folks,

Looking to get some larger drives for one of my boxes. It runs 
exclusively ZFS and has been using Seagate 2TB units up until now (which 
are 512 byte sector).


Anyone offer up suggestions of either 3 or preferably 4TB drives that 
actually work well with ZFS out of the box? (And not perform like 
rubbish)...


I'm using Oracle Solaris 11 , and would prefer not to have to use a 
hacked up zpool to create something with ashift=12.


Thoughts on the best drives - or is Solaris 11 actually ready to go with 
whatever I throw at it? :)


And - am I doomed to have to use these so called 'advanced format' 
drives (which as far as I can tell are in no way actually advanced, and 
only benefit HDD makers and not the end user).


Cheers!

Nathan.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Convert pool from ashift=12 to ashift=9

2012-03-20 Thread Nathan Kroenert

 Jim Klimov wrote:

 It is is hard enough already to justify to an average wife that...snip

That made my night. Thanks, Jim. :)



On 03/20/12 10:29 PM, Jim Klimov wrote:

2012-03-18 23:47, Richard Elling wrote:


...
Yes, it is wrong to think that.


Ok, thanks, we won't try that :)


copy out, copy in. Whether this is easy or not depends on how well
you plan your storage use ...


Home users and personal budgets do tend to have a problem
with planning. Any mistake is to be paid for personally,
and many are left as is. It is is hard enough already
to justify to an average wife that a storage box with
large X-Tb disks needs raidz3 or mirroring and thus
becomes larger and noisier, not to mention almost a
thousand bucks more expensive just for the redundancy
disks, but it will become two times cheaper in a year.

Yup, it is not very easy to find another 10+Tb backup
storage (with ZFS reliability) in a typical home I know
of. Planning is not easy...

But that's a rant... Hoping that in-place BP Rewrite
would arrive and magically solve many problems =)





Questions are:
1) How bad would a performance hit be with 512b blocks used
on a 4kb drive with such efficient emulation?


Depends almost exclusively on the workload and hardware. In my
experience, most folks who bite the 4KB bullet have low-cost HDDs
where one cannot reasonably expect high performance.


Is it
possible to model/emulate the situation somehow in advance
to see if it's worth that change at all?


It will be far more cost effective to just make the change.



Meaning altogether? That with consumer disk which will suck
from performance standpoint anyway, it was not a good idea
to use ashift=12 and it was more cost effective to remain
at ashift=9, to begin with?

What about real-people's tests which seemed to show that
there were substantial performance hits with misaligned
large-block writes (spanning several 4k sectors at wrong
boundaries)?



I had an RFE posted sometime last year about making an
optimisation for both worlds: use formal ashift=9 and allow
writing of small blocks, but align larger blocks at set
boundaries (i.e. offset divisible by 4096 for blocks sized
4096+). Perhaps writing of 512b blocks near each other
should only be reserved for metadata which is dittoed
anyway, so that a whole-sector (4kb) corruption won't
be irreversible for some data. In effect, minblocksize
for userdata would be enforced (by config) at the same
4kb in such case.

This is a zfs-write only change (and some custom pool
or dataset attributes), so the on-disk format and
compatibility should not suffer with this solution.
But I had little feedback whether the idea was at
all reasonable.





2) Is it possible to easily estimate the amount of wasted
disk space in slack areas of the currently active ZFS
allocation (unused portions of 4kb blocks that might
become available if the disks were reused with ashift=9)?


Detailed space use is available from the zfs_blkstats mdb macro
as previously described in such threads.


3) How many parts of ZFS pool are actually affected by the
ashift setting?


Everything is impacted. But that isn't a useful answer.


From what I gather, it is applied at the top-level vdev
level (I read that one can mix ashift=9 and ashift=12
TLVDEVs in one pool spanning several TLVDEVs). Is that
a correct impression?


Yes


If yes, how does ashift size influence the amount of
slots in uberblock ring (128 vs. 32 entries) which is
applied at the leaf vdev level (right?) but should be
consistent across the pool?


It should be consistent across the top-level vdev.

There is 128KB of space available for the uberblock list. The minimum
size of an uberblock entry is 1KB. Obviously, a 4KB disk can't write
only 1KB,
so for 4KB sectors, there are 32 entries in theuberblock list.


So if I have ashift=12 and ashift=9 top-level devices
mixed in the pool, it is okay that some of them would
remember 4x more of pool's TXG history than others?





As far as I see in ZFS on-disk format, all sizes and
offsets are in either bytes or 512b blocks, and the
ashift'ed block size is not actually used anywhere
except to set the minimal block size and its implicit
alignment during writes.


The on-disk format doc is somewhat dated and unclear here. UTSL.


Are there any updates, or the 2006 pdf is the latest available?
For example, is there an effort in illumos/nexenta/openindiana
to publish their version of the current on-disk format? ;)

Thanks for all the answers,
//Jim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bad performance (Seagate drive related?)

2012-02-04 Thread Nathan Kroenert

 Hey there,

Few things:
 - Using /dev/zero is not necessarily a great test. I typically use 
/dev/urandom to create an initial block-o-stuff - something like a gig 
or so worth, in /tmp, then use dd to push that to my zpool. (/dev/zero 
will return dramatically different results depending on pool/dataset 
settings for compression etc.)
 - Indeed - getting a total aggregate of 180MB/s seems pretty low on 
the face of it for the setup you have. What's the controller you are 
using? Any details on the driver, backplane, expander, array or other 
you might be using?
 - Have you tried your dd on individual spindles? You might find that 
they behave differently
 - Does your controller have DRAM on it? Can you put it in passthrough 
mode rather than cache?
 - I have done some testing trying to find odd behaviour like this 
before, and found on different occasions a number of different things:

- Drives: Things like the WD 'green' drives getting in my way
- Alignment for non-EFI labled disks (hm - maybe even on EFI... 
that one was a while ago) (particularly for 4K 'advanced format' (ha!) 
disks)
- The controller was unable to keep up. (In one case, I ended up 
tossing an HP P400 (IIRC) and using the on-motherboard chipset as it was 
considerably faster when running four disks
- Disks with wildly different performance characteristics were also 
bad (eg: Enterprise SATA mixed with 5400 RPM disks. ;)


I'd suggest that you spend a little time validating the basic 
assumptions around:

 - speed of individual disks,
 - speed of individual buses
 - Whether you are being limited by CPU (ie: If you have compression or 
dedupe turned on) (view with mpstat and friends)
 - I'll also note that you are looking close to the number of IOPS I'd 
expect a consumer disk to supply assuming a somewhat random distribution 
of IOPS.
 - Consider that your 180MB/s is actually 360 (well - not quite - but 
it's a lot more than 180). Remember - in a mirror, you literally need to 
write the data twice.

  8.0 3857.8 64.0 337868.8 0.0 64.5 0.0 16.7 0 704 c5
  (Note above is your c5 controller - running at around 337 MB/s)

Incidentally - this seems awfully close to 3Gb/s... How did you say all 
of your external drives were attached? If I didn't know better, I'd be 
asking serious questions about how many lanes of a SAS connection sata 
attached drives were able to use... Actually - I don't know better, so 
I'd ask anyway... ;)


I think this will likely go along way to helping understand where the 
holdup is.


There is also a heap of great stuff on solarisinternals.com which I'd 
highly recommend taking a look at after you have validated the basics...


Were this one of my systems, (and especially if it's new, and you don't 
love your data and can re-create the pool) I'd be tempted to do 
something like a very destructive...


for i in all your disks
do
dd if=/tmp/randomdata.file.I.created.earlier of=/dev/rdsk/${i} 
done

and see how much you can stuff down the pipe.

Remember - this will kill whatever is on the disks, do think twice 
before you do it. ;)


If you can't get at least 80-100MB/s on the outside of the platter, I'd 
suggest you should be looking at layers below ZFS. If you *can*, then 
you start looking further up the stack.


Hope this helps somewhat. Let us know how you go.

Cheers!

Nathan.

On 02/ 1/12 04:52 AM, Mohammed Naser wrote:

Hi list!

I have seen less-than-stellar ZFS performance on a setup of one main
head connected to a JBOD (using SAS, but drives are SATA).  There are
16 drives (8 mirrors) in this pool but I'm getting 180ish MB
sequential writes (using dd, I know it's not precise, but those
numbers should be higher).

With some help on IRC, it seems that part of the reason I'm slowing
down is some drives seem to be slower than the others.  Initially, I
had some drives running at 1.5 mode instead of 3.0 -- They are all
running at 3.0 now.  While running the following dd command, the
output of iostat reflects a much higher %b which seems to say that
those drives are slower (but could they really be slowing down
everything else that much? --- Or am I looking at the wrong spot
here?) -- The pool configuration is also included below

dd if=/dev/zero of=4g bs=1M count=4000

 extended device statistics
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1.00.08.00.0  0.0  0.00.00.2   0   0 c1
 1.00.08.00.0  0.0  0.00.00.2   0   0 c1t2d0
 8.0 3857.8   64.0 337868.8  0.0 64.50.0   16.7   0 704 c5
 0.0  259.00.0 26386.2  0.0  3.60.0   14.0   0  37
c5t50014EE0ACE4AEEFd0
 1.0  266.08.0 27139.2  0.0  3.60.0   13.5   0  37
c5t50014EE056EB0356d0
 2.0  276.0   16.0 19315.1  0.0  3.70.0   13.3   0  40
c5t50014EE00239C976d0
 0.0  279.00.0 19699.0  0.0  3.60.0   13.0   0  37
c5t50014EE0577C459Cd0
 1.0  232.08.0 23061.9  0.0  3.60.0   15.4   0

Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-18 Thread Nathan Kroenert
 Do note, that though Frank is correct, you have to be a little careful 
around what might happen should you drop your original disk, and only 
the large mirror half is left... ;)


On 12/16/11 07:09 PM, Frank Cusack wrote:
You can just do fdisk to create a single large partition.  The 
attached mirror doesn't have to be the same size as the first component.


On Thu, Dec 15, 2011 at 11:27 PM, Gregg Wonderly gregg...@gmail.com 
mailto:gregg...@gmail.com wrote:


Cindy, will it ever be possible to just have attach mirror the
surfaces, including the partition tables?  I spent an hour today
trying to get a new mirror on my root pool.  There was a 250GB
disk that failed.  I only had a 1.5TB handy as a replacement.
 prtvtoc ... | fmthard does not work in this case and so you have
to do the partitioning by hand, which is just silly to fight with
anyway.

Gregg

Sent from my iPhone

On Dec 15, 2011, at 6:13 PM, Tim Cook t...@cook.ms
mailto:t...@cook.ms wrote:


Do you still need to do the grub install?

On Dec 15, 2011 5:40 PM, Cindy Swearingen
cindy.swearin...@oracle.com
mailto:cindy.swearin...@oracle.com wrote:

Hi Anon,

The disk that you attach to the root pool will need an SMI label
and a slice 0.

The syntax to attach a disk to create a mirrored root pool
is like this, for example:

# zpool attach rpool c1t0d0s0 c1t1d0s0

Thanks,

Cindy

On 12/15/11 16:20, Anonymous Remailer (austria) wrote:


On Solaris 10 If I install using ZFS root on only one
drive is there a way
to add another drive as a mirror later? Sorry if this was
discussed
already. I searched the archives and couldn't find the
answer. Thank you.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!

2011-12-18 Thread Nathan Kroenert
 I know some others may already have pointed this out - but I can't see 
it and not say something...


Do you realise that losing a single disk in that pool could pretty much 
render the whole thing busted?


At least for me - the rate at which _I_ seem to lose disks, it would be 
worth considering something different ;)


Cheers!

Nathan.

On 12/19/11 09:05 AM, Jan-Aage Frydenbø-Bruvoll wrote:

Hi,

On Sun, Dec 18, 2011 at 22:00, Fajar A. Nugrahaw...@fajar.net  wrote:

 From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
(or at least Google's cache of it, since it seems to be inaccessible
now:


Keep pool space under 80% utilization to maintain pool performance.
Currently, pool performance can degrade when a pool is very full and
file systems are updated frequently, such as on a busy mail server.
Full pools might cause a performance penalty, but no other issues. If
the primary workload is immutable files (write once, never remove),
then you can keep a pool in the 95-96% utilization range. Keep in mind
that even with mostly static content in the 95-96% range, write, read,
and resilvering performance might suffer.


I'm guessing that your nearly-full disk, combined with your usage
performance, is the cause of slow down. Try freeing up some space
(e.g. make it about 75% full), just tot be sure.

I'm aware of the guidelines you refer to, and I have had slowdowns
before due to the pool being too full, but that was in the 9x% range
and the slowdown was in the order of a few percent.

At the moment I am slightly above the recommended limit, and the
performance is currently between 1/1 and 1/2000 of what the other
pools achieve - i.e. a few hundred kB/s versus 2GB/s on the other
pools - surely allocation above 80% cannot carry such extreme
penalties?!

For the record - the read/write load on the pool is almost exclusively WORM.

Best regards
Jan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-11 Thread Nathan Kroenert

 On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote:

On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:

Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

The only vendor i know that can do this is Netapp

And you really work at Oracle?:)

The answer is definiately yes. ARC caches on-disk blocks and dedup just
reference those blocks. When you read dedup code is not involved at all.
Let me show it to you with simple test:

Create a file (dedup is on):

# dd if=/dev/random of=/foo/a bs=1m count=1024

Copy this file so that it is deduped:

# dd if=/foo/a of=/foo/b bs=1m

Export the pool so all cache is removed and reimport it:

# zpool export foo
# zpool import foo

Now let's read one file:

# dd if=/foo/a of=/dev/null bs=1m
1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)

We read file 'a' and all its blocks are in cache now. The 'b' file
shares all the same blocks, so if ARC caches blocks only once, reading
'b' should be much faster:

# dd if=/foo/b of=/dev/null bs=1m
1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec)

Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
activity. Magic?:)



Hey all,

That reminds me of something I have been wondering about... Why only 12x 
faster? If we are effectively reading from memory - as compared to a 
disk reading at approximately 100MB/s (which is about an average PC HDD 
reading sequentially), I'd have thought it should be a lot faster than 12x.


Can we really only pull stuff from cache at only a little over one 
gigabyte per second if it's dedup data?


Cheers!

Nathan.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacement disks for Sun X4500

2011-07-07 Thread nathan

On 7/07/2011 3:12 PM, X4 User wrote:

I am bumping this thread because I too have the same question ... can I put 
modern 3TB disks (hitachi deskstars) into an old x4500 ?

If not, would the x4540 accept them ?


I'd expect they would *work* but whether it would be a good idea or not 
could be debated and could indeed be a variable answer.


My recollection of those boxes is that they had to use enterprise SATA 
disks in them (as opposed to consumer SATA) to deal with the internal 
vibration that resulted from having so many disks so densely packed. I 
recall being involved in a performance issue or two that was related to 
certain disks in the box going slow as a result of said vibrations.


For an amusing example of how vibrations can impact disks, google 
Brendan Gregg shouting in the datacentre.


So - assuming the disks actually work (and I don't recall there being 
any specific limitations) on the controllers, and you keep an eye on the 
disk response times (so you are aware of any disks not performing 
optimally), you should be OK.


Of course - not in a production sense - as doing that without support 
would just be asking for trouble. ;)


Oh - and as a final point - if you are planning to run Solaris on this 
box, make sure they are not the 4KB sector disks, as at least in my 
experience, their performance with ZFS is profoundly bad. Particularly 
with all the metadata update stuff...


4KB sectors almost seem to me to be a stunt be HDD manufacturers to be 
able to claim more available space for the same device, and to be lazy 
in the CRC generation/checking arena. And to profoundly impact the time 
it takes to read or update anything less than 4K. But - then again, 
maybe I'm missing something.


Cheers!

Nathan.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool import hangs - pls help

2011-06-14 Thread Nathan Kroenert

 Hi Max,

Unhelpful questions about your CPU aside, what else is your box doing?

Can you run up a second or third shell (ssh or whatever) and watch if 
the disks / system are doing any work?


Were it Solaris, I'd run:
iostat -x
prstat -a
vmstat
mpstat (Though as discussed, you have only a single core CPU)
echo ::memstat | mdb -k  (No idea how you might do that in BSD)

Some other things to think about:
 - Have you tried removing the extra memory? I have indeed seen in 
crappy PC hardware where more than 3GB caused some really bad behaviour 
in Solaris.
 - Have you tried booting into a current Solaris (from CD) and seeing 
if it can import the pool? (Don't upgrade - just import) ;)



I'm aware that there were some long import issues discussed on the list 
recently - someone had an import take some 12 hours or more - would be 
worth looking over the last few weeks posts.


Also - getting a truss or pstack (if freebds has that?) of the process 
trying to initiate the import might help some of the more serious folks 
on the list to see where it's getting stuck. (Or if indeed, it's 
actually getting stuck, and not simply catastrophically slow.)


Hope this helps at least a little.

Cheers,

Nathan.

On 06/14/11 03:20 PM, Maximilian Sarte wrote:

Hi,
   I am posting here in a tad of desperation. FYI, I am running FreeNAS 8.0.
Anyhow, I created a raidz1 (tank1) with 4 x 2Tb WD EARS hdds.
All was doing ok until I decided to up the RAM to 4 Gb since it is what was 
recommended. Asap I re-started data migration, the ZFS issued messages 
indicating that the pool was unavailable and froze the system.
After reboot (FN is based in FreeBSD) and re-installing FN (did not want to 
complete booting - probably a corruption on the USB stick it was running from), 
tank1 was unavailable.
Stauts indicates that there are no pools as List does.
Import indicates that tank1 is OK and all 4 hdds are ONLINE and their status 
seems OK.
When I try either:

zpool import tank1
zpool imprt -f tank1
zpool import -fF tank1

the commands simply hang forever (FreeNAS semms OK).

Any suggestions would be immensly appreciated.
Tx!


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub on b123

2011-04-16 Thread Nathan Kroenert

 Hi Karl,

Is there any chance at all that some other system is writing to the 
drives in this pool? You say other things are writing to the same JBOD...


Given that the amount flagged as corrupt is so small, I'd imagine not, 
but thought I'd ask the question anyways.


Cheers!

Nathan.

On 04/16/11 04:52 AM, Karl Rossing wrote:

Hi,

One of our zfs volumes seems to be having some errors. So I ran zpool 
scrub and it's currently showing the following.


-bash-3.2$ pfexec /usr/sbin/zpool status -x
  pool: vdipool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are 
unaffected.
action: Determine if the device needs to be replaced, and clear the 
errors

using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress for 3h10m, 13.53% done, 20h16m to go
config:

NAME STATE READ WRITE CKSUM
vdipool  ONLINE   0 0 0
  raidz1 ONLINE   0 0 0
c9t14d0  ONLINE   0 012  6K repaired
c9t15d0  ONLINE   0 013  167K repaired
c9t16d0  ONLINE   0 011  5.50K repaired
c9t17d0  ONLINE   0 020  10K repaired
c9t18d0  ONLINE   0 015  7.50K repaired
spares
  c9t19d0AVAIL

errors: No known data errors


I have another server connected to the same jbod using drives c8t1d0 
to c8t13d0 and it doesn't seem to have any errors.


I'm wondering how it could have gotten so screwed up?

Karl





CONFIDENTIALITY NOTICE:  This communication (including all 
attachments) is
confidential and is intended for the use of the named addressee(s) 
only and

may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any 
attachments, in
whole or in part, by anyone other than the intended recipient(s) is 
strictly
prohibited.  If you have received this communication in error, please 
notify

the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134

2011-03-08 Thread Nathan Kroenert

Ed -

Simple test. Get onto a system where you *can* disable the disk cache, 
disable it, and watch the carnage.


Until you do that, you can pose as many interesting theories as you like.

Bottom line is that at 75 IOPS per spindle won't impress many people, 
and that's the sort of rate you get when you disable the disk cache.


Nathan.

On 8/03/2011 11:53 PM, Edward Ned Harvey wrote:

From: Jim Dunham [mailto:james.dun...@oracle.com]

ZFS only uses system RAM for read caching,

If your email address didn't say oracle, I'd just simply come out and say
you're crazy, but I'm trying to keep an open mind here...  Correct me where
the following statement is wrong:  ZFS uses system RAM to buffer async
writes.

Sync writes must hit the ZIL first, and then the sync writes are put into
the write buffer along with all the async writes to be written to the main
pool storage.  So after sync writes hit the ZIL and the device write cache
is flushed, they too are buffered in system RAM.



as all writes must be written to
some form of stable storage before acknowledged. If a vdev represents a
whole disk, ZFS will attempt to enable write caching. If a device does not
support write caching, the attempt to set wce fails silently.

Here is an easy analogy to remember basically what you said:  format -e
can control the cache settings for c0t0d0, but cannot control the cache
settings for c0t0d0s0 because s0 is not actually a device.

I contend:

Suppose you have a disk with on-disk write cache enabled.  Suppose a sync
write comes along, so ZFS first performs a sync write to some ZIL sectors.
Then ZFS will issue the cache flush command and wait for it to complete
before acknowledging the sync write; hence the disk write cache does not
benefit sync writes.  So then we start thinking about async writes, and
conclude:  The async writes were acknowledged long ago, when the async
writes were buffered in ZFS system ram, so there is once again, no benefit
from the disk write cache in either situation.

That's my argument, unless somebody can tell me where my logic is wrong.
Disk write cache offers zero benefit.  And disk read cache only offers
benefit in unusual cases that I would call esoteric.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How long should an empty destroy take? snv_134

2011-03-06 Thread Nathan Kroenert
Why wouldn't they try a reboot -d? That would at least get some data in 
the form of a crash dump if at all possible...


A power cycle seems a little medieval to me... At least in the first 
instance.


The other thing I have noted is that sometimes things to get wedged, and 
if you can find where, (mdb -k and take a poke at the stack of some of 
the zfs/zpool commands that are hung to see what they were operating on) 
and trying a zpool clear on that zpool.  Not that I'm recommending that 
you should *need* to, but that has got me unwedged on occasion. (though, 
usually when I have dome something administratively silly... ;)


Nathan.

 On 7/03/2011 12:14 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Yaverot

We're heading into the 3rd hour of the zpool destroy on others.
The system isn't locked up, as it responds to local keyboard input, and

I bet you, you're in a semi-crashed state right now, which will degrade into
a full system crash.  You'll have no choice but to power cycle.  Prove me
wrong, please.   ;-)

I bet, as soon as you type in any zpool or zfs command ... even list
or status they will also hang indefinitely.

Is your pool still 100% full?  That's probably the cause.  I suggest if
possible, immediately deleting something and destroying an old snapshot to
free up a little bit of space.  And then you can move onward...



While this destroy is running all other zpool/zfs commands appear to be
hung.

Oh, sorry, didn't see this before I wrote what I wrote above.  This just
further confirms what I said above.



zpool destroy on an empty pool should be on the order of seconds, right?

zpool destroy is instant, regardless of how much data there is in a pool.
zfs destroy is instant for an empty volume, but zfs destroy takes a long
time for a lot of data.

But as mentioned above, that's irrelevant to your situation.  Because your
system is crashed, and even if you try init 0 or init 6...  They will fail.
You have no choice but to power cycle.

For the heck of it, I suggest init 0 first.  Then wait half an hour, and
power cycle.  Just to try and make the crash as graceful as possible.

As soon as it comes back up, free up a little bit of space, so you can avoid
a repeat.



Yes, I've triple checked, I'm not destroying tank.
While writing the email, I attempted a new ssh connection, it got to the

Last

login: line, but hasn't made it to the prompt.

Oh, sorry, yet again this is confirming what I said above.  semi-crashed and
degrading into a full crash.
Right now, you cannot open any new command prompts.
Soon it will stop responding to ping.  (Maybe 2-12 hours.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sorry everyone was: Re: External SATA drive enclosures + ZFS?

2011-02-26 Thread Nathan Kroenert
 Actually, I find that tremendously encouraging. Lots of internal 
Oracle folks still subscribed to the list!


Much better than none... ;)

Nathan.

On 02/26/11 03:29 PM, Yaverot wrote:

Sorry all, didn't realize that half of Oracle would auto-reply to a public 
mailing list since they're out of the office 9:30 Friday nights.  I'll try to 
make my initial post each month during daylight hours in the future.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-25 Thread Nathan Kroenert
 I can confirm that on *at least* 4 different cards - from different 
board OEMs - I have seen single bit ZFS checksum errors that went away 
immediately after removing the 3114 based card.


I stepped up to the 3124 (pci-x up to 133mhz) and 3132 (pci-e) and have 
never looked back.


I now throw any 3114 card I find into the bin at the first available 
opportunity as they are a pile of doom waiting to insert an exploding 
garden gnome into the unsuspecting chest cavity of your data.


I'd also add that I have never made an effort to determine if it was 
actually the Solaris driver that was at fault - but being that the other 
two cards I have mentioned are available for about $20 a pop, it's not 
worth my time.


I don't recall if Solaris 10 (Sparc or X86) actually has the si3124 
driver, but if it does, for a cheap thrill, they are worth a bash. I 
have no problems pushing 4 disks pretty much flat out on a PCI-X 133 
3124 based card. (note that there was a pci and a pci-x version of the 
3124, so watch out.)


Cheers!

Nathan.

On 02/24/11 02:10 AM, Andrew Gabriel wrote:

Krunal Desai wrote:
On Wed, Feb 23, 2011 at 8:38 AM, Mauricio Tavares 
raubvo...@gmail.com wrote:

   I see what you mean; in
http://mail.opensolaris.org/pipermail/opensolaris-discuss/2008-September/043024.html 


they claim it is supported by the uata driver. What would you suggest
instead? Also, since I have the card already, how about if I try it 
out?


My experience with SPARC is limited, but perhaps the Option ROM/BIOS
for that card is intended for x86, and not SPARC? I might thinking of
another controller, but this could be the case. You could always try
to boot with the card; the worst that'll probably happen is boot hangs
before the OS even comes into play.


SPARC won't try to run the BIOS on the card anyway (it will only run 
OpenFirmware BIOS), but you will have to make sure the card has the 
non-RAID BIOS so that the PCI class doesn't claim it to be a RAID 
controller, which will prevent Solaris going anywhere near the card at 
all. These cards could be bought with either RAID or non-RAID BIOS, 
but RAID was more common. You can (or could some time back) download 
the RAID and non-RAID BIOS from Silicon Image and re-flash which also 
updates the PCI class, and I think you'll need a Windows system to 
actually flash the BIOS.


You might want to do a google search on 3114 data corruption too, 
although it never hit me back when I used the cards.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-25 Thread Nathan Kroenert
 I'm with the gang on this one as far as USB being the spawn of the 
devil for mass storage you want to depend on. I'd rather scoop my eyes 
out with a red hot spoon than depend on permanently attached USB 
storage... And - don't even start me on SPARC and USB storage... It's 
like watching pitch flow... (see 
http://en.wikipedia.org/wiki/Pitch_drop_experiment). I never spent too 
much time working out why - but I never seen to get better than about 
10MB/s with SPARC+USB...


When it comes to cheap... I use cheap external SATA/USB combo enclosures 
(single drive ones) as I like the flexibility of being able to use them 
in eSATA mode nice and fast (and reliable considering the $$) or in USB 
mode should I need to split a mirror off and read it on my laptop, which 
has no esata port...


Also - using the single drive enclosures is by far the cheapest (at 
least here in Oz), and you get redundant power supplies, as they use 
their own mini brick AC/DC units. I'm currently very happy using 2TB 
disks in the external eSATA+USB thingies.


I had been using ASTONE external eSATA/USB units - though it seems my 
local shop has stopped carrying them... I liked them as they had 
perforated side panels, which allow the disk to stay much cooler than 
some of my other enclosures... (And have a better 'vertical' stand if 
you want the disks to stand up, rather than lie on their side.)


If your box has PCI-e slots, grab one or two $20 Silicon Image 3132 
controllers with eSATA ports and you should be golden... You will then 
be able to run between 2 and 4 disks - easily pushing them to their 
maximum platter speed - which for most of the 2TB disks is near enough 
to 100M/s at the outer edges. You will also get considerably higher IOPS 
- particularly when they are sequential - using eSATA.


Note: All of this is with the 'cheap' view... You can most certainly buy 
much better hardware... But bang for buck - I have been happy with the 
above.


Cheers!

Nathan.

On 02/26/11 01:58 PM, Brandon High wrote:

On Fri, Feb 25, 2011 at 4:34 PM, Rich Teerrich.t...@rite-group.com  wrote:

Space is starting to get a bit tight here, so I'm looking at adding
a couple of TB to my home server.  I'm considering external USB or
FireWire attached drive enclosures.  Cost is a real issue, but I also

I would avoid USB, since it can be less reliable than other connection
methods. That's the impression I get from older posts made by Sun
devs, at least. I'm not sure how well Firewire 400 is supported, let
alone Firewire 800.

You might want to consider eSATA. Port multipliers are supported in
recent builds (128+ I think), and will give better performance than
USB. I'm not sure if PMP are supported on Sparc though., since it
requires support in both the controller and PMP.

Consider enclosures from other manufacturers as well. I've heard good
things about Sans Digital, but I've never used them. The 2-drive
enclosure has the same components as the item you linked but 1/2 the
cost via Newegg.


The intent would be put two 1TB or 2TB drives in the enclosure and use
ZFS to create a mirrored pool out of them.  Assuming this enclosure is
set to JBOD mode, would I be able to use this with ZFS?  The enclosure

Yes, but I think the enclosure has a SiI5744 inside it, so you'll
still have one connection from the computer to the enclosure. If that
goes, you'll lose both drives. If you're just using two drives, two
separate enclosures on separate buses may be better. Look at
http://www.sansdigital.com/towerstor/ts1ut.html for instance. There
are also larger enclosures with up to 8 drives.


I can't think of a reason why it wouldn't work, but I also have exactly
zero experience with this kind of set up!

Like I mentioned, USB is prone to some flakiness.


Assuming this would work, given that I can't see to find a 4-drive
version of it, would I be correct in thinking that I could buy two of

You might be better off using separate enclosures for reliability.
Make sure to split the mirrors across the two devices. Use separate
USB controllers if possible, so a bus reset doesn't affect both sides.


Assuming my proposed enclosure would work, and assuming the use of
reasonable quality 7200 RPM disks, how would you expect the performance
to compare with the differential UltraSCSI set up I'm currently using?
I think the DWIS is rated at either 20MB/sec or 40MB/sec, so on the
surface, the USB attached drives would seem to be MUCH faster...

USB 2.0 is about 30-40MB/s under ideal conditions, but doesn't support
any of the command queuing that SCSI does. I'd expect performance to
be slightly lower, and to use slightly more CPU. Most USB controllers
don't support DMA, so all I/O requires CPU time.

What about an inexpensive SAS card (eg: Supermicro AOC-USAS-L4i) and
external SAS enclosure (eg: Sans Digital TowerRAID TR4X). It would
cost about $350 for the setup.

-B



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

2011-02-14 Thread Nathan Kroenert

Thanks for all the thoughts, Richard.

One thing that still sticks in my craw is that I'm not wanting to write 
intermittently. I'm wanting to write flat out, and those writes are 
being held up... Seems to me that zfs should know and do something about 
that without me needing to tune zfs_vdev_max_pending...


Nonetheless, I'm now at a far more balanced point than when I started, 
so that's a good thing. :)


Cheers,

Nathan.

On 15/02/2011 6:44 AM, Richard Elling wrote:

Hi Nathan,
comments below...

On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:


On 14/02/2011 4:31 AM, Richard Elling wrote:

On Feb 13, 2011, at 12:56 AM, Nathan Kroenertnat...@tuneunix.com   wrote:


Hi all,

Exec summary: I have a situation where I'm seeing lots of large reads starving 
writes from being able to get through to disk.

snip

What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.

NexentaStor fans, note that you can do this easily, on the fly, via the Settings 
-
Preferences -   System web GUI.
   -- richard


Hi Richard,

Long time no speak! Anyhoo - See below.

I'm unconvinced that faster disks would help. I think faster disks, at least in 
what I'm observing, would make it suck just as bad, just reading faster... ;) 
Maybe I'm missing something.

Faster disks always help :-)


Queue depth is around 10 (default and unchanged since install), and average 
service time is about 25ms... Below are 1 second samples with iostat - while I 
have included only about 10 seconds, it's representative of what I'm seeing all 
the time.
 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
sd7 342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

ok, we'll take sd6 as an example (the math is easy :-) ...
actv = 10
svc_t = 26.7

actv * svc_t = 267 milliseconds

This is the queue at the disk. ZFS manages its own queue for the disk,
but once it leaves ZFS, there is no way for ZFS to manage it. In the
case of the active queue, the I/Os have left the OS, so even the OS
is unable to change what is in the queue or directly influence when
the I/Os will be finished.

In ZFS, the queue has a priority scheduler and does place a higher
priority on async writes than async reads (since b130 or so). But what
you can see is that the intermittent nature of the async writes get
stuck behind the 267 milliseconds as the queue drains the reads.
[no, I'm not sure if that makes sense, try again...]
If it sends reads continuously and writes occasionally, it will appear
that reads have much more domination. In older releases, when the
reads and writes had the same priority, this looks even worse.


 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b

sd6 422.10.0 54025.00.0  0.0 10.0   23.6   1 100
sd7 422.10.0 54025.00.0  0.0 10.0   23.6   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
sd7 327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 388.07.0 49406.4  290.0  0.0  9.8   24.8   1 100
sd7 409.01.0 52350.32.0  0.0  9.5   23.2   1  99

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 423.00.0 54148.60.0  0.0 10.0   23.6   1 100
sd7 413.00.0 52868.50.0  0.0 10.0   24.2   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 400.02.0 51081.22.0  0.0 10.0   24.8   1 100
sd7 384.04.0 49153.24.0  0.0 10.0   25.7   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 401.91.0 51448.98.0  0.0 10.0   24.8   1 100
sd7 424.90.0 54392.40.0  0.0 10.0   23.5   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
sd7 189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
sd7 307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 405.02.0 51843.86.0

[zfs-discuss] ZFS read/write fairness algorithm for single pool

2011-02-13 Thread Nathan Kroenert
   on default
data  setuid on default
data  readonly   offdefault
data  zoned  offdefault
data  snapdirhidden default
data  aclinherit restricted default
data  canmount   on default
data  xattr  on default
data  copies 1  default
data  version3  -
data  utf8only   off-
data  normalization  none   -
data  casesensitivitysensitive  -
data  vscan  offdefault
data  nbmand offdefault
data  sharesmb   offdefault
data  refquota   none   default
data  refreservation none   local
data  primarycache   alldefault
data  secondarycache alldefault
data  usedbysnapshots12.2G  -
data  usedbydataset  500G   -
data  usedbychildren 864G   -
data  usedbyrefreservation   0  -
data  logbiaslatencydefault
data  dedup  offdefault
data  mlslabel   none   default
data  sync   standard   default
data  encryption off-
data  keysource  none   default
data  keystatus  none   -
data  rekeydate  -  default
data  rstchown   on default
data  com.sun:auto-snapshot  true   local


Obviously, the potential for performance issues is considerable - and 
should it be required, I can provide some other detail, but given that 
this is so easy to reproduce, I thought I'd get it out there, just in case.


It is also worthy of note that commands like 'zfs list' take anywhere 
from 20 to 40 seconds to run when I have that sort of workload running - 
which also seems less optimal.


I tried to recreate this issue on the boot pool (rpool) which is a 
single 2.5 7200rpm disk (to take the cache controller out of the 
configuration) - but this seemed to hard-hang the system (yep - even 
caps lock / num-lock were non-responsive) - and I did not have any 
watchdog/snooping set and ran out of steam myself so just hit the big 
button.


When I get the chance, I'll give the rpool thing a crack again, but 
overall, it seems to me that the behavior I'm observing is not great...


I'm also happy to supply lockstats / dtrace output etc if it'll help.

Thoughts?

Cheers!

Nathan.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

2011-02-13 Thread Nathan Kroenert

Hi Steve,

Thanks for the thoughts - I think that everything you asked about is in 
the original email - but for reference again, it's 151a (s11 express).


Are you really suggesting, for a single user system I need 16GB of 
memory, just to get ZFS to be able to write when it's reading? (and even 
them, that would be contingent on you getting repeat, cached hits on the 
ARC). That's hardly sensible, and anything but enterprise. I know I'm 
only talking about my little baby box at the moment, but extend that to 
a large database application, and I'm seeing badness all round.


Worse - If I'm reading a 45GB contiguous file (say, HD video), the only 
way an ARC will help me is if I have 64GB, and have read it in the 
past... especially if I'm reading it sequentially. That's 
inconceivable!! (cue reference to the Princess Bride :). I'd also ad 
that for the most part, 8GB is plenty for ZFS, and there are a lot of 
Sun/Oracle customers using it now in LDOM environments where 8GB is just 
great in the control/IO domain.


I don't think trying to blame the system in this case is the right 
answer. ZFS schedules the read/write activities, and to me it seems that 
it's just not doing that.


I was suspicious of the impact the HP Raid controller is having - and 
how it might be reacting to what's being pushed at it, so re-created 
exactly this problem again on a different system with native non-cached 
SATA controllers. Issue is identical. (Though I have since determined 
that my HP raid controller is actually *slowing* my reads and writes to 
disk! ;)


Cheers!

Nathan.




On 14/02/2011 4:08 AM, gon...@comcast.net wrote:

Hi Nathan,

Maybe  it is buried somewhere in your email, but I did not see what 
zfs version you are using.


This is rather important, because the  145+ kernels work a lot better 
in many ways than the

early ones ( say 134-ish).

So whenever you are reporting various ZFS issues, something like 
`uname -a` to report the kernel rev

is most useful.

Writes starved by reads has been a complaint in early ZFS, I certainy 
do not see

any evidence of this in the 145+ kernels.

There is a fair amount of tuning and configuration that can be done
(adding ssd-s to your pool, zil vs no zil, how cacheing is configured, 
ie what to cache..)

8 Gig is not a lot of memory for ZFS, I would recommend double of that.

If all goes well, most reads would be statisfied from ARC, and not 
interfere with writes.



Steve


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS read/write fairness algorithm for single pool

2011-02-13 Thread Nathan Kroenert

On 14/02/2011 4:31 AM, Richard Elling wrote:

On Feb 13, 2011, at 12:56 AM, Nathan Kroenertnat...@tuneunix.com  wrote:


Hi all,

Exec summary: I have a situation where I'm seeing lots of large reads starving 
writes from being able to get through to disk.

snip

What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.

NexentaStor fans, note that you can do this easily, on the fly, via the Settings 
-
Preferences -  System web GUI.
   -- richard



Hi Richard,

Long time no speak! Anyhoo - See below.

I'm unconvinced that faster disks would help. I think faster disks, at 
least in what I'm observing, would make it suck just as bad, just 
reading faster... ;) Maybe I'm missing something.


Queue depth is around 10 (default and unchanged since install), and 
average service time is about 25ms... Below are 1 second samples with 
iostat - while I have included only about 10 seconds, it's 
representative of what I'm seeing all the time.

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
sd7 342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b

sd6 422.10.0 54025.00.0  0.0 10.0   23.6   1 100
sd7 422.10.0 54025.00.0  0.0 10.0   23.6   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
sd7 327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 388.07.0 49406.4  290.0  0.0  9.8   24.8   1 100
sd7 409.01.0 52350.32.0  0.0  9.5   23.2   1  99

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 423.00.0 54148.60.0  0.0 10.0   23.6   1 100
sd7 413.00.0 52868.50.0  0.0 10.0   24.2   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 400.02.0 51081.22.0  0.0 10.0   24.8   1 100
sd7 384.04.0 49153.24.0  0.0 10.0   25.7   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 401.91.0 51448.98.0  0.0 10.0   24.8   1 100
sd7 424.90.0 54392.40.0  0.0 10.0   23.5   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
sd7 189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
sd7 307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6 405.02.0 51843.86.0  0.0 10.0   24.5   1 100
sd7 408.03.0 52227.8   10.0  0.0 10.0   24.3   1 100

Bottom line is that ZFS does not seem to be caring about getting my 
writes to disk when there is a heavy read workload.


I have also confirmed that it's not the RAID controller either - 
behaviour is identical with direct attach SATA.


But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes 
things to swing dramatically!
 - At 1, writes proceed much more than reads - 20mb/s read per 
spindle:35mb/s write per spindle

 - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s
 - At 3, it's starting to lean more heavily to reads again, but writes 
at least get a whack - 35mb/s per spindle read:15-20mb/s write.

 - At 4, we are closer to 35-40mb/s read, 15mb/s write

By the time we get back to the default of 0xa, writes drop off almost 
completely.


The crossover (on the box with no RAID controller) seems to be 5. 
Anything more than that, and writes get shouldered out the way almost 
completely.


So - aside from the obvious - manually setting zfs_vdev_max_pending - do 
you have any thoughts on ZFS being able to make this sort of 
determination by itself? It would be somewhat of a shame to bust out 
such 'whacky knobs' for plain old direct attach SATA disks to get balance...


Also - can I set this property per-vdev? (just in case I have sata and, 
say, a USP-V connected to the same box)?


Thanks again, and good to see you are still playing close by!

Cheers!

Nathan.


pci bus 0x0002 cardnum 0x00 function 0x00: vendor

[zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-10 Thread Nathan
http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery

Is there a way except for buying enterprise (RAID specific) drives for a array 
to use normal drives?

Does anyone have any success stories regarding a particular model?

The TLER cannot be edited on newer drives from Western Digital unfortunately.  
Are there some settings in ZFS that can be used to compensate for this?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-10 Thread Nathan
Sorry I probably didn't make myself exactly clear.

Basically drives without particular TLER settings drop out of RAID randomly.

* Error Recovery - This is called various things by various manufacturers 
(TLER, ERC, CCTL). In a Desktop drive, the goal is to do everything possible to 
recover the data. In an Enterprise, the goal is to ALWAYS 
return SOMETHING within the timeout period; if the data can't be recovered 
within that time, let the RAID controller reconstruct it. Wikipedia article.

Does this happen in ZFS?  Maybe it is particular to hardware RAID controllers.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2009-12-10 Thread Nathan
http://www.stringliterals.com/?p=77

This guy talks about it too under Hard Drives.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] White box server for OpenSolaris

2009-09-25 Thread Nathan
While I am about to embark on building a home NAS box using OpenSolaris with 
ZFS.

Currently I have a chassis that will hold 16 hard drives, although not in 
caddies - down time doesn't bother me if I need to switch a drive, probably 
could do it running anyways just a bit of a pain. :)

I am after suggestions of motherboard, CPU and ram.  Basically I want ECC ram 
and at least two PCI-E x4 channels.  As I want to run 2 x AOC-USAS_L8i cards 
for 16 drives.

I want something with a bit of guts but over the top.  I know the HCL is there 
but I want to see what other people are using in their solutions.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Procedure for Initial ZFS Replication to Remote Site by External HDD?

2009-08-12 Thread Nathan Hudson-Crim
What is the best way to use an external HDD for initial replication of a large 
ZFS filesystem?

System1 had filesystem; System2 needs to have a copy of filesystem.
Used send/recv on System1 to put filesys...@snap1 on connected external HDD.
Exported external HDD pool and connected/imported on System2; then used 
send/recv to copy it to System2.

Incremental send/recv from System1 for @snap1 to @snap2 fails.

Clearly I have failed to take some measure of preparation. Osol isn't 
recognizing that the destination filesystem is/should be the same as the source 
@snap1.

I didn't find this issue on the forums but I will continue searching.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Procedure for Initial ZFS Replication to Remote Site by External HDD?

2009-08-12 Thread Nathan Hudson-Crim
I figured out what I did wrong. The filesystem as received on the external HDD 
had multiple snapshots, but I failed to check for them. So I had created a 
snapshot in order to send/recv on System2. That doesn't work, obviously.

A new local send/recv of the filesystem's correct snapshot did the trick. Now 
System1 is replicating incrementals from San Diego to Seattle. Huzzah!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding SATA cards for ZFS; was Lundman home NAS

2009-08-03 Thread Nathan Fiedler
I have not carried out any research into this area, but when I was
building my home server I wanted to use a Promise SATA-PCI card, but
alas (Open)Solaris has no support at all for the Promise chipsets.
Instead I used a rather old card based on the sil3124 chipset.

n


On Mon, Aug 3, 2009 at 9:35 AM, Neal Pollackneal.poll...@sun.com wrote:
 Let's take this first point; card that works with Solaris

 I might try to find some engineers to write device drivers to
 improve this situation.
 Would this alias be interested in teaching me which 3 or 4 cards they would
 put at the top of the wish list for Solaris support?
 I assume the current feature gap is defined as needing driver support
 for PCI-express add-in cards that have 4 to 8 ports inexpensive
 JBOD, not expensive HW RAID, and can handle hot-swap while running OS.
 Would this be correct?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] ZFS and deduplication

2009-08-03 Thread Nathan Hudson-Crim
 On Sun, 02 Aug 2009 15:26:12 -0700 (PDT)
 Andre Lue no-re...@opensolaris.org wrote:
 
 Was de-duplication slated for snv_119? 
 
 No.
 
  If not can anyone say which snv_xxx and in which
 form will we
  see it (synchronous, asynchronous both)?
 
 No, and no.
 
 Sorry,
 James

Andre, I've seen this before. What you have to do is ask James each question 3 
times and on the third time he will tell the truth. ;)

I know it's not in the preview of 2010.2 (build 118).

On a serious note, James - do you know the status of the presentation recording 
on ZFS deduplication?

Many thanks,
Nathan
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] ZFS and deduplication

2009-08-03 Thread Nathan Hudson-Crim
I was absolutely not impugning you in any way but rather trying to lighten the 
mood with a little Austin Powers humor. I'll refrain from this in the future.

And quite to the contrary of any negative feelings, I am pleased and grateful 
that you are participating in this conversation.

Regards,
Nathan
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lundman home NAS

2009-07-31 Thread Nathan Fiedler
Yes, please write more about this. The photos are terrific and I
appreciate the many useful observations you've made. For my home NAS I
chose the Chenbro ES34069 and the biggest problem was finding a
SATA/PCI card that would work with OpenSolaris and fit in the case
(technically impossible without a ribbon cable PCI adapter). After
seeing this, I may reconsider my choice.

For the SATA card, you mentioned that it was a close fit with the case
power switch. Would removing the backplane on the card have helped?

Thanks

n


On Fri, Jul 31, 2009 at 5:22 AM, Jorgen Lundmanlund...@gmo.jp wrote:
 I have assembled my home RAID finally, and I think it looks rather good.

 http://www.lundman.net/gallery/v/lraid5/p1150547.jpg.html

 Feedback is welcome.

 I have yet to do proper speed tests, I will do so in the coming week should
 people be interested.

 Even though I have tried to use only existing, and cheap, parts the end sum
 became higher than I expected. Final price is somewhere in the 47,000 yen
 range. (Without hard disks)

 If I were to make and sell these, they would be 57,000 or so, so I do not
 really know if anyone would be interested. Especially since SOHO NAS devices
 seem to start around 80,000.

 Anyway, sure has been fun.

 Lund
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] deduplication

2009-07-30 Thread Nathan Hudson-Crim
I'll maintain hope for seeing/hearing the presentation until you guys announce 
that you had NASA store the tape for safe-keeping.

Bump'd.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Motherboard for home zfs/solaris file server

2009-07-21 Thread Nathan Fiedler
Regarding the SATA card and the mainboard slots, make sure that
whatever you get is compatible with the OS. In my case I chose
OpenSolaris which lacks support for Promise SATA cards. As a result,
my choices were very limited since I had chosen a Chenbro ES34069 case
and Intel Little Falls 2 mainboard. Basically I had to go with the
SYBA Sil3124 card and a flexible PCI adapter. More details here:
http://cafenate.wordpress.com/2009/07/13/building-a-nas-box/

No ECC memory, but I don't mind because the case has a great form
factor and hot swappable drive bays. If I could find a low power board
that supported ECC and OpenSolaris, I'd consider switching.

Good luck.

n
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Honesty after a power failure

2009-03-24 Thread Nathan Kroenert

Hey, Dennis -

I can't help but wonder if the failure is a result of zfs itself finding 
some problems post restart...


Is there anything in your FMA logs?

  fmstat

for a summary and

  fmdump

for a summary of the related errors

eg:
drteeth:/tmp # fmdump
TIME UUID SUNW-MSG-ID
Nov 03 13:57:29.4190 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 ZFS-8000-D3
Nov 03 13:57:29.9921 916ce3e2-0c5c-e335-d317-ba1e8a93742e ZFS-8000-D3
Nov 03 14:04:58.8973 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d ZFS-8000-CS
Mar 05 18:04:40.7116 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d FMD-8000-4M 
Repaired
Mar 05 18:04:40.7875 ff2f60f8-2906-676a-bfb7-ccbd9c7f957d FMD-8000-6U 
Resolved
Mar 05 18:04:41.0052 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 FMD-8000-4M 
Repaired
Mar 05 18:04:41.0760 e28210d7-b7aa-42e0-a3e8-9ba21332d1c7 FMD-8000-6U 
Resolved


then for example,

  fmdump -vu e28210d7-b7aa-42e0-a3e8-9ba21332d1c7

and

  fmdump -Vvu e28210d7-b7aa-42e0-a3e8-9ba21332d1c7

will show more and more information about the error. Note that some of 
it might seem like rubbish. The important bits should be obvious though 
- things like the SUNW error message is (like ZFS-8000-D3), which can be 
pumped into


  sun.com/msg

to see what exactly it's going on about.

Note also that there should also be something interesting in the 
/var/adm/messages log to match and 'faulted' devices.


You might also find an

  fmdump -e

and

  fmdump -eV

to be interesting - This is the *error* log as opposed to the *fault* 
log. (Every 'thing that goes wrong' is an error, only those that are 
diagnosed are considered a fault.)


Note that in all of these fm[dump|stat] commands, you are really only 
looking at the two sets of data. The errors - that is the telemetry 
incoming to FMA - and the faults. If you include a -e, you view the 
errors, otherwise, you are looking at the faults.


By the way - sun.com/msg has a great PDF on it about the predictive self 
healing technologies in Solaris 10 and will offer more interesting 
information.


Would be interesting to see *why* ZFS / FMA is feeling the need to fault 
your devices.


I was interested to see on one of my boxes that I have actually had a 
*lot* of errors, which I'm now going to have to investigate... Looks 
like I have a dud rocket in my system... :)


Oh - And I saw this:

Nov 03 14:04:31.2783 ereport.fs.zfs.checksum

Score one more for ZFS! This box has a measly 300GB mirrored, and I have 
already seen dud data. (heh... It's also got non-ecc memory... ;)


Cheers!

Nathan.


Dennis Clarke wrote:

On Tue, 24 Mar 2009, Dennis Clarke wrote:

You would think so eh?
But a transient problem that only occurs after a power failure?

Transient problems are most common after a power failure or during
initialization.


Well the issue here is that power was on for ten minutes before I tried
to do a boot from the ok pronpt.

Regardless, the point is that the ZPool shows no faults at boot time and
then shows phantom faults *after* I go to init 3.

That does seem odd.

Dennsi


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] reboot when copying large amounts of data

2009-03-12 Thread Nathan Kroenert

definitely time to bust out some mdb -k and see what it's moaning about.

I did not see the screenshot earlier... sorry about that.

Nathan.

Blake wrote:

I start the cp, and then, with prstat -a, watch the cpu load for the
cp process climb to 25% on a 4-core machine.

Load, measured for example with 'uptime', climbs steadily until the reboot.

Note that the machine does not dump properly, panic or hang - rather,
it reboots.

I attached a screenshot earlier in this thread of the little bit of
error message I could see on the console.  The machine is trying to
dump to the dump zvol, but fails to do so.  Only sometimes do I see an
error on the machine's local console - mos times, it simply reboots.



On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert
nathan.kroen...@sun.com wrote:

Hm -

Crashes, or hangs? Moreover - how do you know a CPU is pegged?

Seems like we could do a little more discovery on what the actual problem
here is, as I can read it about 4 different ways.

By this last piece of information, I'm guessing the system does not crash,
but goes really really slow??

Crash == panic == we see stack dump on console and try to take a dump
hang == nothing works == no response - might be worth looking at mdb -K
   or booting with a -k on the boot line.

So - are we crashing, hanging, or something different?

It might simply be that you are eating up all your memory, and your physical
backing storage is taking a while to catch up?

Nathan.

Blake wrote:

My dump device is already on a different controller - the motherboards
built-in nVidia SATA controller.

The raidz2 vdev is the one I'm having trouble with (copying the same
files to the mirrored rpool on the nVidia controller work nicely).  I
do notice that, when using cp to copy the files to the raidz2 pool,
load on the machine climbs steadily until the crash, and one proc core
pegs at 100%.

Frustrating, yes.

On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
maidakalexand...@johndeere.com wrote:

If you're having issues with a disk contoller or disk IO driver its
highly likely that a savecore to disk after the panic will fail.  I'm not
sure how to work around this, maybe a dedicated dump device not on a
controller that uses a different driver then the one that you're having
issues with?

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Blake
Sent: Wednesday, March 11, 2009 4:45 PM
To: Richard Elling
Cc: Marc Bevand; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] reboot when copying large amounts of data

I guess I didn't make it clear that I had already tried using savecore to
retrieve the core from the dump device.

I added a larger zvol for dump, to make sure that I wasn't running out of
space on the dump device:

r...@host:~# dumpadm
Dump content: kernel pages
 Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore
directory: /var/crash/host
 Savecore enabled: yes

I was using the -L option only to try to get some idea of why the system
load was climbing to 1 during a simple file copy.



On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
richard.ell...@gmail.com wrote:

Blake wrote:

I'm attaching a screenshot of the console just before reboot.  The
dump doesn't seem to be working, or savecore isn't working.

On Wed, Mar 11, 2009 at 11:33 AM, Blake blake.ir...@gmail.com wrote:


I'm working on testing this some more by doing a savecore -L right
after I start the copy.



savecore -L is not what you want.

By default, for OpenSolaris, savecore on boot is disabled.  But the
core will have been dumped into the dump slice, which is not used for
swap.
So you should be able to run savecore at a later time to collect the
core from the last dump.
-- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//



--
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia

Re: [zfs-discuss] reboot when copying large amounts of data

2009-03-12 Thread Nathan Kroenert
definitely time to bust out some mdb -K or boot -k and see what it's 
moaning about.


I did not see the screenshot earlier... sorry about that.

Nathan.

Blake wrote:

I start the cp, and then, with prstat -a, watch the cpu load for the
cp process climb to 25% on a 4-core machine.

Load, measured for example with 'uptime', climbs steadily until the reboot.

Note that the machine does not dump properly, panic or hang - rather,
it reboots.

I attached a screenshot earlier in this thread of the little bit of
error message I could see on the console.  The machine is trying to
dump to the dump zvol, but fails to do so.  Only sometimes do I see an
error on the machine's local console - mos times, it simply reboots.



On Thu, Mar 12, 2009 at 1:55 AM, Nathan Kroenert
nathan.kroen...@sun.com wrote:

Hm -

Crashes, or hangs? Moreover - how do you know a CPU is pegged?

Seems like we could do a little more discovery on what the actual problem
here is, as I can read it about 4 different ways.

By this last piece of information, I'm guessing the system does not crash,
but goes really really slow??

Crash == panic == we see stack dump on console and try to take a dump
hang == nothing works == no response - might be worth looking at mdb -K
   or booting with a -k on the boot line.

So - are we crashing, hanging, or something different?

It might simply be that you are eating up all your memory, and your physical
backing storage is taking a while to catch up?

Nathan.

Blake wrote:

My dump device is already on a different controller - the motherboards
built-in nVidia SATA controller.

The raidz2 vdev is the one I'm having trouble with (copying the same
files to the mirrored rpool on the nVidia controller work nicely).  I
do notice that, when using cp to copy the files to the raidz2 pool,
load on the machine climbs steadily until the crash, and one proc core
pegs at 100%.

Frustrating, yes.

On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
maidakalexand...@johndeere.com wrote:

If you're having issues with a disk contoller or disk IO driver its
highly likely that a savecore to disk after the panic will fail.  I'm not
sure how to work around this, maybe a dedicated dump device not on a
controller that uses a different driver then the one that you're having
issues with?

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Blake
Sent: Wednesday, March 11, 2009 4:45 PM
To: Richard Elling
Cc: Marc Bevand; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] reboot when copying large amounts of data

I guess I didn't make it clear that I had already tried using savecore to
retrieve the core from the dump device.

I added a larger zvol for dump, to make sure that I wasn't running out of
space on the dump device:

r...@host:~# dumpadm
Dump content: kernel pages
 Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore
directory: /var/crash/host
 Savecore enabled: yes

I was using the -L option only to try to get some idea of why the system
load was climbing to 1 during a simple file copy.



On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling
richard.ell...@gmail.com wrote:

Blake wrote:

I'm attaching a screenshot of the console just before reboot.  The
dump doesn't seem to be working, or savecore isn't working.

On Wed, Mar 11, 2009 at 11:33 AM, Blake blake.ir...@gmail.com wrote:


I'm working on testing this some more by doing a savecore -L right
after I start the copy.



savecore -L is not what you want.

By default, for OpenSolaris, savecore on boot is disabled.  But the
core will have been dumped into the dump slice, which is not used for
swap.
So you should be able to run savecore at a later time to collect the
core from the last dump.
-- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//



--
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia

Re: [zfs-discuss] reboot when copying large amounts of data

2009-03-12 Thread Nathan Kroenert
For what it's worth, I have been running Nevada (so, same kernel as 
opensolaris) for ages (at least 18 months) on a Gigabyte board with the 
MCP55 chipset and it's been flawless.


I liked it so much, I bought it's newer brother, based on the nvidia 
750SLI chipset...   M750SLI-DS4


Cheers!

Nathan.


On 13/03/09 09:21 AM, Dave wrote:



Tim wrote:



On Thu, Mar 12, 2009 at 2:22 PM, Blake blake.ir...@gmail.com 
mailto:blake.ir...@gmail.com wrote:


I've managed to get the data transfer to work by rearranging my disks
so that all of them sit on the integrated SATA controller.

So, I feel pretty certain that this is either an issue with the
Supermicro aoc-sat2-mv8 card, or with PCI-X on the motherboard 
(though

I would think that the integrated SATA would also be using the PCI
bus?).

The motherboard, for those interested, is an HD8ME-2 (not, I now find
after buying this box from Silicon Mechanics, a board that's on the
Solaris HCL...)


http://www.supermicro.com/Aplus/motherboard/Opteron2000/MCP55/h8dme-2.cfm 



So I'm not considering one of LSI's HBA's - what do list members 
think

about this device:

http://www.provantage.com/lsi-logic-lsi00117~7LSIG03X.htm
http://www.provantage.com/lsi-logic-lsi00117%7E7LSIG03X.htm



I believe the MCP55's SATA controllers are actually PCI-E based.


I use Tyan 2927 motherboards. They have on-board nVidia MCP55 chipsets, 
which is the same chipset at the X4500 (IIRC). I wouldn't trust the 
MCP55 chipset in OpenSolaris. I had random disk hangs even while the 
machine was mostly idle.


In Feb 2008 I bought AOC-SAT2-MV8 cards and moved all my drives to these 
add-in cards. I haven't had any issues with drive hanging since. There 
does not seem to be any problems with the SAT2-MV8 under heavy load in 
my servers from what I've seen.


When the SuperMicro AOC-USAS-L8i came out later last year, I started 
using them instead. They work better than the SAT2-MV8s.


This card needs a 3U or bigger case:
http://www.supermicro.com/products/accessories/addon/AOC-USAS-L8i.cfm

This is the low profile card that will fit in a 2U:
http://www.supermicro.com/products/accessories/addon/AOC-USASLP-L8i.cfm

They both work in normal PCI-E slots on my Tyan 2927 mobos.

Finding good non-Sun hardware that works very well under OpenSolaris is 
frustrating to say the least. Good luck.


--
Dave
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--


//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] reboot when copying large amounts of data

2009-03-11 Thread Nathan Kroenert

Hm -

Crashes, or hangs? Moreover - how do you know a CPU is pegged?

Seems like we could do a little more discovery on what the actual 
problem here is, as I can read it about 4 different ways.


By this last piece of information, I'm guessing the system does not 
crash, but goes really really slow??


Crash == panic == we see stack dump on console and try to take a dump
hang == nothing works == no response - might be worth looking at mdb -K
or booting with a -k on the boot line.

So - are we crashing, hanging, or something different?

It might simply be that you are eating up all your memory, and your 
physical backing storage is taking a while to catch up?


Nathan.

Blake wrote:

My dump device is already on a different controller - the motherboards
built-in nVidia SATA controller.

The raidz2 vdev is the one I'm having trouble with (copying the same
files to the mirrored rpool on the nVidia controller work nicely).  I
do notice that, when using cp to copy the files to the raidz2 pool,
load on the machine climbs steadily until the crash, and one proc core
pegs at 100%.

Frustrating, yes.

On Thu, Mar 12, 2009 at 12:31 AM, Maidak Alexander J
maidakalexand...@johndeere.com wrote:

If you're having issues with a disk contoller or disk IO driver its highly 
likely that a savecore to disk after the panic will fail.  I'm not sure how to 
work around this, maybe a dedicated dump device not on a controller that uses a 
different driver then the one that you're having issues with?

-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Blake
Sent: Wednesday, March 11, 2009 4:45 PM
To: Richard Elling
Cc: Marc Bevand; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] reboot when copying large amounts of data

I guess I didn't make it clear that I had already tried using savecore to 
retrieve the core from the dump device.

I added a larger zvol for dump, to make sure that I wasn't running out of space 
on the dump device:

r...@host:~# dumpadm
 Dump content: kernel pages
  Dump device: /dev/zvol/dsk/rpool/bigdump (dedicated) Savecore directory: 
/var/crash/host
 Savecore enabled: yes

I was using the -L option only to try to get some idea of why the system load 
was climbing to 1 during a simple file copy.



On Wed, Mar 11, 2009 at 4:58 PM, Richard Elling richard.ell...@gmail.com 
wrote:

Blake wrote:

I'm attaching a screenshot of the console just before reboot.  The
dump doesn't seem to be working, or savecore isn't working.

On Wed, Mar 11, 2009 at 11:33 AM, Blake blake.ir...@gmail.com wrote:


I'm working on testing this some more by doing a savecore -L right
after I start the copy.



savecore -L is not what you want.

By default, for OpenSolaris, savecore on boot is disabled.  But the
core will have been dumped into the dump slice, which is not used for swap.
So you should be able to run savecore at a later time to collect the
core from the last dump.
-- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] schedulers [was: zfs related google summer of code ideas - your vote]

2009-03-04 Thread Nathan Kroenert

Hm - a ZilArc??

Or, slarc?

Or L2ArZi

I'm tried something sort of similar to this when fooling around, adding 
different *slices* for ZIL / L2ARC but as I'm too poor to afford good 
SSD's my resolut was poor at beat... ;)


Having ZFS manage some 'arbitrary fast stuff' and sorting out it's own 
ZIL and L2ARC would be interesting, though, given the propensity for 
SSD's to be either fast read or fast write at the moment, you may well 
require some whacky knobs to get it to do what you actually want it to...


hm.

Nathan.

Bill Sommerfeld wrote:

On Wed, 2009-03-04 at 12:49 -0800, Richard Elling wrote:

But I'm curious as to why you would want to put both the slog and
L2ARC on the same SSD?


Reducing part count in a small system.

For instance: adding L2ARC+slog to a laptop.  I might only have one slot
free to allocate to ssd. 


IMHO the right administrative interface for this is for zpool to allow
you to add the same device to a pool as both cache and ssd, and let zfs
figure out how to not step on itself when allocating blocks.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
///
// Nathan Kroenert  nathan.kroen...@sun.com  //
// Senior Systems Engineer  Phone:+61 3 9869 6255//
// Global Systems Engineering   Fax:+61 3 9869 6288  //
// Level 7, 476 St. Kilda Road   //
// Melbourne 3004   VictoriaAustralia//
///



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] destroy means destroy, right?

2009-01-29 Thread Nathan Kroenert
For years, we resisted stopping rm -r / because people should know 
better, until *finally* someone said - you know what - that's just dumb.

Then, just like that, it was fixed.

Yes - This is Unix.

Yes - Provide the gun and allow the user to point it.

Just don't let it go off in their groin or when pointed at their foot, 
or provide at least some protection when they do.

Having even limited amount of restore capability will provide the user 
with steel capped boots and a codpiece. It won't protect them from 
herpes or fungus but it might deflect the bullet.

On 01/30/09 08:19, Jacob Ritorto wrote:
 I like that, although it's a bit of an intelligence insulter.  Reminds
 me of the old pdp11 install (
 http://charles.the-haleys.org/papers/setting_up_unix_V7.pdf ) --
 
 This step makes an empty file system.
 6.The next thing to do is to restore the data onto the new empty
 file system. To do this you respond
   to the ':' printed in the last step with
 (bring in the program restor)
 : tm(0,4)  ('ht(0,4)' for TU16/TE16)
 tape? tm(0,5)  (use 'ht(0,5)' for TU16/TE16)
 disk? rp(0,0)(use 'hp(0,0)' for RP04/5/6)
 Last chance before scribbling on disk. (you type return)
 (the tape moves, perhaps 5-10 minutes pass)
 end of tape
 Boot
 :
   You now have a UNIX root file system.
 
 
 
 
 On Thu, Jan 29, 2009 at 3:42 PM, Orvar Korvar
 knatte_fnatte_tja...@yahoo.com wrote:
 Maybe add a timer or something? When doing a destroy, ZFS will keep 
 everything for 1 minute or so, before overwriting. This way the disk won't 
 get as fragmented. And if you had fat fingers and typed wrong, you have up 
 to one minute to undo. That will catch 80% of the mistakes?
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 


//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New RAM disk from ACARD might be interesting

2009-01-29 Thread Nathan Kroenert
As it presents as standard SATA, there should be no reason for this not 
to work...

It has battery backup, and CF for backup / restore from DDR2 in the 
event of power loss... Pretty cool. (Would have preferred a super-cap, 
but oh, well... ;)

Should make an excellent ZIL *and* L2ARC style device...

Seems a little pricey for what it is though.

It's going onto my list of what I'd buy if I had the money... ;)

Nathan.

On 01/30/09 12:10, Janåke Rönnblom wrote:
 ACARD have launched a new RAM disk which can take up to 64 GB of ECC RAM 
 while still looking like a standard SATA drive. If anyone remember the 
 Gigabyte I-RAM this might be a new development in this area.
 
 Its called ACARD ANS-9010 and up...
 
 http://www.acard.com.tw/english/fb01-product.jsp?idno_no=270prod_no=ANS-9010type1_title=%20Solid%20State%20Drivetype1_idno=13
 
 This might be interesting to use as a cheap log instead of SSD cards... This 
 test compares it with both Intel SSD (consumer and pro):
 
 http://www.techreport.com/articles.x/16255/1
 
 However the test is more from a homeuser point of view...
 
 Anyone got the money and time to test it ;)
 
 -J

-- 


//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New RAM disk from ACARD might be interesting

2009-01-29 Thread Nathan Kroenert
You could be the first...

Man up! ;)

Nathan.

Will Murnane wrote:
 On Thu, Jan 29, 2009 at 21:11, Nathan Kroenert nathan.kroen...@sun.com 
 wrote:
 Seems a little pricey for what it is though.
 For what it's worth, there's also a 9010B model that has only one sata
 port and room for six dimms instead of eight at $250 instead of $400.
 That might fit in your budget a little easier...  I'm considering one
 for a log device.  I wish someone else could test it first and report
 problems, but someone's gotta take the jump first.
 
 It looks like this device (the 9010, that is) is also being marketed
 as the HyperDrive V at the same price point.
 
 Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] destroy means destroy, right?

2009-01-28 Thread Nathan Kroenert
I'm no authority, but I believe it's gone.

Some of the others on the list might have some funky thoughts, but I 
would suggest that if you have already done any other I/O's to the disk 
that you have likely rolled past the point of no return.

Anyone else care to comment?

As a side note, I had a look for anything that looked like a CR for zfs 
destroy / undestroy and could not find one.

Anyone interested in me submitting an RFE to have something like a

zfs undestroy pool/fs

capability?

Clearly, there would be limitations in how long you would have to get 
the command to work, but it would have it's merits...

Cheers!

Nathan.

Jacob Ritorto wrote:
 Hi,
 I just said zfs destroy pool/fs, but meant to say zfs destroy
 pool/junk.  Is 'fs' really gone?
 
 thx
 jake
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is Disabling ARC on SolarisU4 possible?

2009-01-28 Thread Nathan Kroenert
Also - My experience with a very small ARC is that your performance will 
stink. ZFS is an advanced filesystem that IMO makes some assumptions 
about capability and capacity of current hardware. If you don't give 
what it's expecting, your results may be equally unexpected.

If you are keen to test the *actual* disk performance, you should just 
use the underlying disk device like /dev/rdsk/c0t0d0s0

Beware, however, that any writes to these devices will indeed result in 
the loss of the data on those devices, zpools or other.

Cheers.

Nathan.

Richard Elling wrote:
 Rob Brown wrote:
 Afternoon,

 In order to test my storage I want to stop the cacheing effect of the 
 ARC on a ZFS filesystem. I can do similar on UFS by mounting it with 
 the directio flag.
 
 No, not really the same concept, which is why Roch wrote
 http://blogs.sun.com/roch/entry/zfs_and_directio
 
 I saw the following two options on a nevada box which presumably 
 control it:

 primarycache
 secondarycache
 
 Yes, to some degree this offers some capability. But I don't believe
 they are in any release of Solaris 10.
 -- richard
 
 But I’m running Solaris 10U4 which doesn’t have them -can I disable it?

 Many thanks

 Rob




 *|* *Robert Brown - **ioko *Professional Services *|
 | **Mobile:* +44 (0)7769 711 885 *|
 *
 

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] destroy means destroy, right?

2009-01-28 Thread Nathan Kroenert
He's not trying to recover a pool - Just a filesystem...

:)

bdebel...@intelesyscorp.com wrote:
 Recovering Destroyed ZFS Storage Pools.
 You can use the zpool import -D command to recover a storage pool that has 
 been destroyed.
 http://docs.sun.com/app/docs/doc/819-5461/gcfhw?a=view

-- 
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cifs perfomance

2009-01-22 Thread Nathan Kroenert
Are you able to qualify that a little?

I'm using a realtek interface with OpenSolaris and am yet to experience 
any issues.

Nathan.

Brandon High wrote:
 On Wed, Jan 21, 2009 at 5:40 PM, Bob Friesenhahn
 bfrie...@simple.dallas.tx.us wrote:
 Several people reported this same problem.  They changed their
 ethernet adaptor to an Intel ethernet interface and the performance
 problem went away.  It was not ZFS's fault.
 
 It may not be a ZFS problem, but it is a OpenSolaris problem. The
 drivers for hardware Realtek and other NICs are ... not so great.
 
 -B
 

-- 
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cifs perfomance

2009-01-22 Thread Nathan Kroenert
Interesting. I'll have a poke...

Thanks!

Nathan.

Brandon High wrote:
 On Thu, Jan 22, 2009 at 1:29 PM, Nathan Kroenert
 nathan.kroen...@sun.com wrote:
 Are you able to qualify that a little?

 I'm using a realtek interface with OpenSolaris and am yet to experience any
 issues.
 
 There's a lot of anecdotal evidence that replacing the rge driver with
 the gani driver can fix poor NFS and CIFS performance. Another option
 is to use an Intel NIC in place of the Realtek.
 
 Search the archives for gani or slow CIFS and you'll find several
 people who resolved poor performance by getting rid of the rge driver.
 
 While it's not hard evidence, it seems to indicate that there are
 problems with the driver (and most likely the hardware).
 
 -B
 

-- 
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hot spare not so hot ??

2009-01-20 Thread Nathan Kroenert
An interesting interpretation of using hot spares.

Could it be that the hot-spare code only fires if the disk goes down 
whilst the pool is active?

hm.

Nathan.

Scot Ballard wrote:
 I have configured a test system with a mirrored rpool and one hot spare. 
  I powered the systems off, pulled one of the disks from rpool to 
 simulate a hardware failure. 
 
 The hot spare is not activating automatically.  Is there something more 
 i should have done to make this work ? 
 
 
   pool: rpool
  state: DEGRADED
 status: One or more devices could not be opened.  Sufficient replicas 
 exist for
 the pool to continue functioning in a degraded state.
 action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
  scrub: none requested
 config:
 
 NAMESTATE READ WRITE CKSUM
 rpool   DEGRADED 0 0 0
   mirrorDEGRADED 0 0 0
 c0d0s0  ONLINE   0 0 0
 c0d1s0  UNAVAIL  0 0 0  cannot open
 spares
   c1d1s0AVAIL   
 
 errors: No known data errors
 
 
 
 Thanks
 
 
   -Scot
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS tale of woe and fail

2009-01-18 Thread Nathan Kroenert
Hey, Tom -

Correct me if I'm wrong here, but it seems you are not allowing ZFS any 
sort of redundancy to manage.

I'm not sure how you can class it a ZFS fail when the Disk subsystem has 
failed...

Or - did I miss something? :)

Nathan.

Tom Bird wrote:
 Morning,
 
 For those of you who remember last time, this is a different Solaris,
 different disk box and different host, but the epic nature of the fail
 is similar.
 
 The RAID box that is the 63T LUN has a hardware fault and has been
 crashing, up to now the box and host got restarted and both came up
 fine.  However, just now as I have got replacement hardware in position
 and was ready to start copying, it went bang and my data has all gone.
 
 Ideas?
 
 
 r...@cs4:~# zpool list
 NAME  SIZE   USED  AVAILCAP  HEALTH  ALTROOT
 content  62.5T  59.9T  2.63T95%  ONLINE  -
 
 r...@cs4:~# zpool status -v
   pool: content
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: none requested
 config:
 
 NAMESTATE READ WRITE CKSUM
 content ONLINE   0 032
   c2t8d0ONLINE   0 032
 
 errors: Permanent errors have been detected in the following files:
 
 content:0x0
 content:0x2c898
 
 r...@cs4:~# find /content
 /content
 r...@cs4:~# (yes that really is it)
 
 r...@cs4:~# uname -a
 SunOS cs4.kw 5.11 snv_99 sun4v sparc SUNW,Sun-Fire-T200
 
 from format:
2. c2t8d0 IFT-S12S-G1033-363H-62.76TB
   /p...@7c0/p...@0/p...@8/LSILogic,s...@0/s...@8,0
 
 Also, content does not show in df output.
 
 thanks

-- 
///
// Nathan Kroenert  nathan.kroen...@sun.com  //
// Senior Systems Engineer  Phone:+61 3 9869 6255//
// Global Systems Engineering   Fax:+61 3 9869 6288  //
// Level 7, 476 St. Kilda Road   //
// Melbourne 3004   VictoriaAustralia//
///



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Odd network performance with ZFS/CIFS

2009-01-13 Thread Nathan Kroenert
2C from Oz:

Windows (at least XP - I have thus far been lucky enough to avoid 
running vista on metal) has packet schedulers, quality of service 
settings and other crap that can severely impact windows performance on 
the network.

I have found that setting the following made a difference to me:
  - Disable Jumbo Frames (as I have only a very cheap crappy gig-switch 
and if I try to drive it hard with jumbo's enabled, it falls in a heap)
  - Lose the 'deterministic network enhancer' under windows
  - Lose the QoS packet scheduler
  - Check the interface properties and go looking for something that 
sounds like 'optimize for CPU / optimize for speed' and set it to speed

  - Depending on workload and packet sizes, it might also be worth 
looking at disabling nagle algorithm on the Solaris box.
See http://www.sun.com/servers/coolthreads/tnb/lighttpd.jsp for a quick 
explanation...

It would be interesting to see if you see the same issues using a 
Solaris or other OS client.

Hope this helps somewhat. Let us know how it goes.

Nathan.

fredrick phol wrote:
 I'm currently experiencing exactly the same problem and it's been driving me 
 nuts. Tried open soalris and am currently running the latest version of SXCE 
 both with exactly the same results.
 
 This issue occurs with both CIFS which shows the speed degrade and ISCSI 
 which just starts off at the lowest speed but exhibits the same peaks and 
 troughs
 
 I have 4x500GB drives in RAIDz1 config on an AMD 780G mobo.
 
 speed tests using DD have shown read rates of ~140MB/s and write rates of 
 `120MB/s (humourously slightly faster than one of my friends arrays on linux 
 and intel hardware) 
 
 Currently the transfer will sit at about 18% gige network utilisation for 10 
 seconds then dip to 0 and come straight back up to 18% this happens at 
 regular predictable intervals, there is no randomness. I've tried two 
 different switches, one a consumer grade switch from linksys and one a low 
 end distribution switch from 3com both exhibit exactly the same behaviour.
 
 The only computer accessing the solaris box is w windows vista 64 sp1 machine.
 
 Currently I'm guessing that the transfer issues have somethign to do with the 
 onboard realtek network card in the solaris box. Possibly a driver issue? 
 I've got a dual port intel server nic on order to replace it and test with.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the new consumer NAS devices run OpenSolaris?

2009-01-12 Thread Nathan Kroenert
Meh -

I doubt you hurt anyone. Most people have kill files for that sort of 
stuff. heh. ;)

On the 'which if these should work' sort of question, if you do happen 
to try any of those systems, and they work, remember to submit the 
details to the HCL. :)

I'm keen to give it a whack on a small box myself, but have not had the 
time or the funds. The Atom stuff should work pretty well, and even with 
  2GB of memory, if it's just acting as a NAS server, it should have 
plenty of poke. (assuming you are only using it for NAS... ;)

Oh - and assuming you don't enable stuff like gzip-9 compression, which 
might, on the slower Atom style chips, get in the way.

Looking forward to any reports.

Nathan.

On 13/01/09 01:47 PM, JZ wrote:
 ok, was I too harsh on the list?
 sorry folks, as I said, I have the biggest ego.
 
 no one can hurt that by trying to fight me, but yes, it can be hurt if I 
 have to hurt the friends I love in protecting my ego or my other friends' 
 ego.
 
 but no one can get hurt if we don't claim what we have or what we know is 
 the best of all.
 
 a contribution to help the problem today can be better than 100% 
 strategically correct in the long run.
 
 we use what we have today, but if that usage will impact the life or death 
 of a promising technology branch, as a living thing, maybe we don't want to 
 use the best of today.
 
 everyone has their own need and want, and there is no better/worse, 
 right/wrong in the choice of technology.
 
 but some technologies can work together in a constructive fashion, and some 
 in a destructive fashion.
 
 please, be constructive.
 and you will hear much less from me.
 
 best,
 z 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 


//
// Nathan Kroenert  nathan.kroen...@sun.com //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need Help Recovering Zpool

2008-12-15 Thread Nathan Hand
I have moved the zpool image file to an OpenSolaris machine running 101b.

r...@opensolaris:~# uname -a
SunOS opensolaris 5.11 snv_101b i86pc i386 i86pc Solaris

Here I am able to attempt an import of the pool and at least the OS does not 
panic.

r...@opensolaris:~# zpool import -d /mnt
  pool: zones
id: 17407806223688303760
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit 'zpool upgrade'.
config:

zones   ONLINE
  /mnt/zpool.zones  ONLINE

But it hangs forever when I actually attempt the import.

r...@opensolaris:~# zpool import -d /mnt -f zones
never returns

The thread associated with the import is stuck on txg_wait_synced.

r...@opensolaris:~# echo 0t757::pid2proc|::walk thread|::findstack -v | mdb -k
stack pointer for thread d6dcc800: d51bdc44
  d51bdc74 swtch+0x195()
  d51bdc84 cv_wait+0x53(d62ef1e6, d62ef1a8, d51bdcc4, fa15f9e1)
  d51bdcc4 txg_wait_synced+0x90(d62ef040, 0, 0, 2)
  d51bdd34 spa_load+0xd0b(d6c1f080, da5dccd8, 2, 1)
  d51bdd84 spa_import_common+0xbd()
  d51bddb4 spa_import+0x18(d6c8f000, da5dccd8, 0, fa187dac)
  d51bdde4 zfs_ioc_pool_import+0xcd(d6c8f000, 0, 0)
  d51bde14 zfsdev_ioctl+0xe0()
  d51bde44 cdev_ioctl+0x31(2d8, 5a02, 8042450, 13, da532b28, d51bdf00)
  d51bde74 spec_ioctl+0x6b(d6dbfc80, 5a02, 8042450, 13, da532b28, d51bdf00)
  d51bdec4 fop_ioctl+0x49(d6dbfc80, 5a02, 8042450, 13, da532b28, d51bdf00)
  d51bdf84 ioctl+0x171()
  d51bdfac sys_call+0x10c()

There is a corresponding thread stuck on zio_wait.

d5e50de0 fec1dad80   0  60 d5cb76c8
  PC: _resume_from_idle+0xb1THREAD: txg_sync_thread()
  stack pointer for thread d5e50de0: d5e50a58
swtch+0x195()
cv_wait+0x53()
zio_wait+0x55()
dbuf_read+0x1fd()
dbuf_will_dirty+0x30()
dmu_write+0xd7()
space_map_sync+0x304()
metaslab_sync+0x284()
vdev_sync+0xc6()
spa_sync+0x35c()
txg_sync_thread+0x295()
thread_start+8()

I see from another discussion on zfs-discuss that Victor Latushkin helped Erik 
Gulliksson recover from a similar situation by using a specially patched zfs 
module. Would it be possible for me to get that same module?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need Help Invalidating Uberblock

2008-12-15 Thread Nathan Hand
I don't know if this is relevant or merely a coincidence but the zdb command 
fails an assertion in the same txg_wait_synced function.

r...@opensolaris:~# zdb -p /mnt -e zones 
Assertion failed: tx-tx_threads == 2, file ../../../uts/common/fs/zfs/txg.c, 
line 423, function txg_wait_synced
Abort (core dumped)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need Help Recovering Zpool

2008-12-15 Thread Nathan Hand
Thanks for the reply. I tried the following:

$ zpool import -o failmode=continue -d /mnt -f zones

But the situation did not improve. It still hangs on the import.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need Help Invalidating Uberblock

2008-12-15 Thread Nathan Hand
I've had some success.

I started with the ZFS on-disk format PDF.

http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf

The uberblocks all have magic value 0x00bab10c. Used od -x to find that value 
in the vdev.

r...@opensolaris:~# od -A x -x /mnt/zpool.zones | grep b10c 00ba
02 b10c 00ba   0004   
020400 b10c 00ba   0004   
020800 b10c 00ba   0004   
020c00 b10c 00ba   0004   
021000 b10c 00ba   0004   
021400 b10c 00ba   0004   
021800 b10c 00ba   0004   
021c00 b10c 00ba   0004   
022000 b10c 00ba   0004   
022400 b10c 00ba   0004   
...

So the uberblock array begins 128kB into the vdev and there's an uberblock 
every 1kb.

To identify the active uberblock I used zdb.

r...@kestrel:/opt$ zdb -U -uuuv zones
Uberblock
magic = 00bab10c
version = 4
txg = 1504158 (= 0x16F39E) 
guid_sum = 10365405068077835008 = (0x8FD950FDBBD02300)
timestamp = 1229142108 UTC = Sat Dec 13 15:21:48 2008 = (0x4943385C)
rootbp = [L0 DMU objset] 400L/200P DVA[0]=0:52e3edc00:200 
DVA[1]=0:6f9c1d600:200 DVA[2]=0:16e280400:200 fletcher4 lzjb LE contiguous 
birth=1504158 fill=172 cksum=b0a5275f3:474e0ed6469:e993ed9bee4d:205661fa1d4016

I spy those hex values at the uberblock starting 027800.

027800 b10c 00ba   0004   
027810 f39e 0016   2300 bbd0 50fd 8fd9
027820 385c 4943   0001   
027830 1f6e 0297   0001   
027840 e0eb 037c   0001   
027850 1402 00b7   0001  0703 800b
027860        
027870     f39e 0016  
027880 00ac    75f3 0a52 000b 
027890 6469 e0ed 0474  ee4d ed9b e993 
0278a0 4016 fa1d 5661 0020    
0278b0        

Breaking it down

* the first 8 bytes are the magic uberblock number (b10c 00ba  )
* the second 8 bytes are the version number (0004   )
* the third 8 bytes are the transaction group a.k.a txg (f39e 0016  )
* the fourth 8 bytes are the guid sum (2300 bbd0 50fd 8fd9)
* the fifth 8 bytes are the timestamp (385c 4943  )

The remainder of the bytes are the blkptr structure and I'll ignore them.

Those values match the active uberblock exactly, so I know this is the on-disk 
location of the first active uberblock.

Scanning further I find an exact duplicate 256kB later in the device.

067800 b10c 00ba   0004   
067810 f39e 0016   2300 bbd0 50fd 8fd9
067820 385c 4943   0001   
067830 1f6e 0297   0001   
067840 e0eb 037c   0001   
067850 1402 00b7   0001  0703 800b
067860        
067870     f39e 0016  
067880 00ac    75f3 0a52 000b 
067890 6469 e0ed 0474  ee4d ed9b e993 
0678a0 4016 fa1d 5661 0020    
0678b0        

I know ZPOOL keeps four copies of the label; two at the front and two at the 
back, each 256kB in size.

r...@opensolaris:~# ls -l /mnt/zpool.zones 
-rw-r--r-- 1 root root 42949672960 Dec 15 04:49 /mnt/zpool.zones

That's 0xA = 42949672960 = 41943040kB. If I subtract 512kB I should see 
the third and fourth labels.

r...@opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=41942528 | od -A x -x | 
grep 385c 4943  
027820 385c 4943   0001   
512+0 records in
512+0 records out
524288 bytes (524 kB) copied, 0.0577013 s, 9.1 MB/s
r...@opensolaris:~# 

Oddly enough I see the third uberblock at 0x27800 but the fourth uberblock at 
0x67800 is missing. Perhaps corrupted?

No matter. I now work out the exact offsets to the three valid uberblocks and 
confirm I'm looking at the right uberblocks.

r...@opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=158 | od -A x -x | head -3
00 b10c 00ba   0004   
10 f39e 0016   2300 bbd0 50fd 8fd9
20 385c 4943   0001   
r...@opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=414 | od -A x -x | head -3
00 b10c 00ba   0004   
10 f39e 0016   2300 bbd0 50fd 8fd9
20 385c 4943   0001   
r...@opensolaris:~# dd if=/mnt/zpool.zones bs=1k skip=41942686 | od -A x -x | 
head -3
00 b10c 00ba   0004   
10 f39e 0016   2300 bbd0 50fd 8fd9
20 385c 4943   0001   

They all have the same timestamp. I'm looking at the correct uberblocks. Now I 
intentionally harm them.

r...@opensolaris:/mnt# dd if=/dev/zero of=/mnt/zpool.zones bs=1k seek=158 
count=1 conv=notrunc
1+0 records in
1+0 records out
1024 bytes (1.0 kB) copied, 0.000315229 s, 3.2 

[zfs-discuss] Need Help Invalidating Uberblock

2008-12-14 Thread Nathan Hand
I have a ZFS pool that has been corrupted. The pool contains a single device 
which was actually a file on UFS. The machine was accidentally halted and now 
the pool is corrupt. There are (of course) no backups and I've been asked to 
recover the pool. The system panics when trying to do anything with the pool.

root@:/$ zpool status
panic[cpu1]/thread=fe8000758c80: assertion failed: dmu_read(os, 
smo-smo_object, offset, size, entry_map) == 0 (0x5 == 0x0), file: 
../../common/fs/zfs/space_map.c, line: 319
system reboots

I've booted single user, moved /etc/zfs/zpool.cache out of the way, and now 
have access to the pool from the command line. However zdb fails with a similar 
assertion.

r...@kestrel:/opt$ zdb -U -bcv zones
Traversing all blocks to verify checksums and verify nothing leaked ...
Assertion failed: dmu_read(os, smo-smo_object, offset, size, entry_map) == 0 
(0x5 == 0x0), file ../../../uts/common/fs/zfs/space_map.c, line 319
Abort (core dumped)

I've read Victor's suggestion to invalidate the active uberblock, forcing ZFS 
to use an older uberblock and thereby recovering the pool. However I don't know 
how to figure the offset to the uberblock. I have the following information 
from zdb.

r...@kestrel:/opt$ zdb -U -uuuv zones
Uberblock
magic = 00bab10c
version = 4
txg = 1504158
guid_sum = 10365405068077835008
timestamp = 1229142108 UTC = Sat Dec 13 15:21:48 2008
rootbp = [L0 DMU objset] 400L/200P DVA[0]=0:52e3edc00:200 
DVA[1]=0:6f9c1d600:200 DVA[2]=0:16e280400:200 fletcher4 lzjb LE contiguous 
birth=1504158 fill=172 cksum=b0a5275f3:474e0ed6469:e993ed9bee4d:205661fa1d4016

I've also checked the labels.

r...@kestrel:/opt$ zdb -U -lv zpool.zones 

LABEL 0

version=4
name='zones'
state=0
txg=4
pool_guid=17407806223688303760
top_guid=11404342918099082864
guid=11404342918099082864
vdev_tree
type='file'
id=0
guid=11404342918099082864
path='/opt/zpool.zones'
metaslab_array=14
metaslab_shift=28
ashift=9
asize=42944954368

LABEL 1

version=4
name='zones'
state=0
txg=4
pool_guid=17407806223688303760
top_guid=11404342918099082864
guid=11404342918099082864
vdev_tree
type='file'
id=0
guid=11404342918099082864
path='/opt/zpool.zones'
metaslab_array=14
metaslab_shift=28
ashift=9
asize=42944954368

LABEL 2

version=4
name='zones'
state=0
txg=4
pool_guid=17407806223688303760
top_guid=11404342918099082864
guid=11404342918099082864
vdev_tree
type='file'
id=0
guid=11404342918099082864
path='/opt/zpool.zones'
metaslab_array=14
metaslab_shift=28
ashift=9
asize=42944954368

LABEL 3

version=4
name='zones'
state=0
txg=4
pool_guid=17407806223688303760
top_guid=11404342918099082864
guid=11404342918099082864
vdev_tree
type='file'
id=0
guid=11404342918099082864
path='/opt/zpool.zones'
metaslab_array=14
metaslab_shift=28
ashift=9
asize=42944954368

I'm hoping somebody here can give me direction on how to figure the active 
uberblock offset, and the dd parameters I'd need to intentionally corrupt the 
uberblock and force an earlier uberblock into service.

The pool is currently on Solaris 05/08 however I'll transfer the pool to 
OpenSolaris if necessary.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is a manual zfs scrub neccessary?

2008-11-09 Thread Nathan Kroenert
The big win for me in doing a periodic scrub is that in normal 
operation, ZFS only checks data as it's read back form the disks.

If you don't periodically scrub, errors that happen over time won't be 
caught until I next read that actual data, which might be inconvenient 
if it's a long time since the initial data was written.

As I have a lot of data that is pretty much only read once or twice 
after it's originally written, I could have stuff going bad over time 
that I don't know about.

Scrubbing makes sure there is a limit on the amount of time between each 
'surprise!'.

:)

I scrub once every month or so, depending on the system.

So, in direct answer to your question, No - You don't *need* to scrub. 
But - It's better if you do. ;)

My 2c.

Nathan.

On 10/11/08 11:38 AM, Douglas Walker wrote:
 Hi,
 
 I'm running a 3Tb RAIDZ2 array and was wondering about the zfs scrub 
 function.
 
 This server runs as my backup server and receives an rsync every night.
 
 I was wondering if I _need_ to explicitly run a zfs scrub on my zpool 
 periodically.
 
 There's a lot of info on google about running a scrub but not whether 
 it's actually needed or under what circumstances you might run one  -  
 so I thought I'd ask the list it's opinions on this.
 
 If zfs does a background scrub continually anyways - is there any need 
 to manually run a scrub?
 
 I'd imagine a scrub of a 3Tb array would take quite a while (its 7200rpm 
 SATA disks) and if I ran a scrub this would likely overlap with my 
 nightly rsyncs causing yet more I/O. Wouldn't this stress the disks more?
 
 If it is necessary - how often are people running a manually scrub? Once 
 a week? month?
 
 
 regards
 
 
 D
 

-- 


//
// Nathan Kroenert  [EMAIL PROTECTED]   //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] boot -L

2008-11-06 Thread Nathan Kroenert
A quick google shows that it's not so much about the mirror, but the BE...

http://opensolaris.org/os/community/zfs/boot/zfsbootFAQ/

Might help?

Nathan.

On  7/11/08 02:39 PM, Krzys wrote:
 What am I doing wrong? I have sparc V210 and I am having difficulty with boot 
 -L, I was under the impression that boot -L will give me options to which zfs 
 mirror I could boot my root disk?
 
 Anyway but even not that, I am seeing some strange behavior anyway... After 
 trying boot -L I am unabl eto boot my system unless I do reset-all, is that 
 normal? I have Solaris 10 U6 that I just upgraded my box to and I wanted to 
 try 
 all the cool things about zfs root disk mirroring and so on, but so far its 
 quite strange experience with this whole thing...
 
 [22:21:25] @adas: /root  init 0
 [22:21:51] @adas: /root  stopping NetWorker daemons:
   nsr_shutdown -q
 svc.startd: The system is coming down.  Please wait.
 svc.startd: 90 system services are now being stopped.
 svc.startd: The system is down.
 syncing file systems... done
 Program terminated
 {0} ok boot -L
 
 SC Alert: Host System has Reset
 Probing system devices
 Probing memory
 Probing I/O buses
 
 Sun Fire V210, No Keyboard
 Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
 OpenBoot 4.22.33, 4096 MB memory installed, Serial #64938415.
 Ethernet address 0:3:ba:de:e1:af, Host ID: 83dee1af.
 
 
 
 Rebooting with command: boot -L
 Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0:a  File and args: -L
 
 Can't open bootlst
 
 Evaluating:
 The file just loaded does not appear to be executable.
 {1} ok boot disk0
 Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL PROTECTED],0  
 File and args:
 ERROR: /[EMAIL PROTECTED],60: Last Trap: Fast Data Access MMU Miss
 
 {1} ok boot disk1
 Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL PROTECTED],0  
 File and args:
 ERROR: /[EMAIL PROTECTED],60: Last Trap: Fast Data Access MMU Miss
 
 {1} ok boot
 ERROR: /[EMAIL PROTECTED],60: Last Trap: Fast Data Access MMU Miss
 
 {1} ok reset-all
 Probing system devices
 Probing memory
 Probing I/O buses
 
 Sun Fire V210, No Keyboard
 Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
 OpenBoot 4.22.33, 4096 MB memory installed, Serial #64938415.
 Ethernet address 0:3:ba:de:e1:af, Host ID: 83dee1af.
 
 
 
 Boot device: /[EMAIL PROTECTED],60/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0:a  File and args:
 SunOS Release 5.10 Version Generic_137137-09 64-bit
 Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
 Use is subject to license terms.
 Hardware watchdog enabled
 Hostname: adas
 Reading ZFS config: done.
 Mounting ZFS filesystems: (3/3)
 
 adas console login: Nov  6 22:27:13 squid[361]: Squid Parent: child process 
 363 
 started
 Nov  6 22:27:18 adas ufs: NOTICE: mount: not a UFS magic number (0x0)
 starting NetWorker daemons:
   nsrexecd
 
 console login:
 
 
 Does anyone have any idea why is that happening? what am I doing wrong?
 
 Thanks for help.
 
 Regards,
 
 Chris
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 


//
// Nathan Kroenert  [EMAIL PROTECTED]   //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FYI - proposing storage pm project

2008-11-04 Thread Nathan Kroenert
Not wanting to hijack this thread, but...

I'm a simple man with simple needs. I'd like to be able to manually spin 
down my disks whenever I want to...

Anyone come up with a way to do this? ;)

Nathan.

Jens Elkner wrote:
 On Mon, Nov 03, 2008 at 02:54:10PM -0800, Yuan Chu wrote:
 Hi,
   
   a disk may take seconds or
   even tens of seconds to come on line if it needs to be powered up
   and spin up.
 
 Yes - I really hate this on my U40 and tried to disable PM for HDD[s]
 completely. However, haven't found a way to do this (thought
 /etc/power.conf is the right place, but either it doesn't work as
 explained or is not the right place).
 
 HDD[s] are HITACHI HDS7225S Revision: A9CA
 
 Any hints, how to switch off PM for this HDD?
 
 Regards,
 jel.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] add autocomplete feature for zpool, zfs command

2008-10-10 Thread Nathan Kroenert
Hm -

This caused me to ask the question: Who keeps the capabilities in sync?

Is there a programmatic way we can have bash (or other shells) 
interrogate zpool and zfs to find out what it's capabilities are?

I'm thinking something like having bash spawn a zfs command to see what 
options are available in that current zfs / zpool version...

That way, you would never need to do anything to bash/zfs once it was 
done the first time... do it once, and as ZFS changes, the prompts 
change automatically...

Or - is this old hat, and how we do it already? :)

Nathan.

On 10/10/08 05:06 PM, Boyd Adamson wrote:
 Alex Peng [EMAIL PROTECTED] writes:
 Is it fun to have autocomplete in zpool or zfs command?

 For instance -

 zfs cr 'Tab key'  will become zfs create
 zfs clone 'Tab key'  will show me the available snapshots
 zfs set 'Tab key'  will show me the available properties, then zfs 
 set com 'Tab key' will become zfs set compression=,  another 'Tab key' 
 here would show me on/off/lzjb/gzip/gzip-[1-9]
 ..


 Looks like a good RFE.
 
 This would be entirely under the control of your shell. The zfs and
 zpool commands have no control until after you press enter on the
 command line.
 
 Both bash and zsh have programmable completion that could be used to add
 this (and I'd like to see it for these and other solaris specific
 commands).
 
 I'm sure ksh93 has something similar.
 
 Boyd
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 


//
// Nathan Kroenert  [EMAIL PROTECTED]   //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZSF Solaris

2008-09-30 Thread Nathan Kroenert
Actually, the one that'll hurt most is ironically the most closely 
related to bad database schema design... With a zillion files in the one 
directory, if someone does an 'ls' in that directory, it'll not only 
take ages, but steal a whole heap of memory and compute power...

Provided the only things that'll be doing *anything* in that directory 
are using indexed methods, there is no real problem from a ZFS 
perspective, but if something decides to list (or worse, list and sort) 
that directory, it won't be that pleasant.

Oh - That's of course assuming you have sufficient memory in the system 
to cache all that metadata somewhere... If you don't then that's another 
zillion I/O's you need to deal with each time you list the entire directory.

an ls -1rt on a directory with about 1.2 million files with names like 
afile1202899 takes minutes to complete on my box, and we see 'ls' get to 
in excess of 700MB rss... (and that's not including the memory zfs is 
using to cache whatever it can.)

My box has the ARC limited to about 1GB, so it's obviously undersized 
for such a workload, but still gives you an indication...

I generally look to keep directories to a size that allows the utilities 
that work on and in it to perform at a reasonable rate... which for the 
most part is around the 100K files or less...

Perhaps you are using larger hardware than I am for some of this stuff? :)

Nathan.

On  1/10/08 07:29 AM, Toby Thain wrote:
 On 30-Sep-08, at 7:50 AM, Ram Sharma wrote:
 
 Hi,

 can anyone please tell me what is the maximum number of files that  
 can be there in 1 folder in Solaris with ZSF file system.

 I am working on an application in which I have to support 1mn  
 users. In my application I am using MySql MyISAM and in MyISAM  
 there is 3 files created for 1 table. I am having application  
 architechture in which each user will be having separate table, so  
 the expected number of files in database folder is 3mn.
 
 That sounds like a disastrous schema design. Apart from that, you're  
 going to run into problems on several levels, including O/S resources  
 (file descriptors) and filesystem scalability.
 
 --Toby
 
 I have read somewhere that there is a limit of each OS to create  
 files in a folder.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 


//
// Nathan Kroenert  [EMAIL PROTECTED]   //
// Senior Systems Engineer  Phone:  +61 3 9869 6255 //
// Global Systems Engineering   Fax:+61 3 9869 6288 //
// Level 7, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] CF to SATA adapters for boot device

2008-08-20 Thread Nathan Kroenert
I second that question, and also ask what brand folks like for 
performance and compatibility?

Ebay is killing me with vast choice and no detail... ;)

Nathan.

Al Hopper wrote:
 On Wed, Aug 20, 2008 at 12:57 PM, Neal Pollack [EMAIL PROTECTED] wrote:
 Ian Collins wrote:
 Brian Hechinger wrote:
 On Wed, Aug 20, 2008 at 05:17:45PM +1200, Ian Collins wrote:

 Has anyone here had any luck using a CF to SATA adapter?

 I've just tried an Addonics ADSACFW CF to SATA adaptor with an 8GB card 
 that I wanted to use for a boot pool and even though the BIOS reports the 
 disk, Solaris B95 (or the installer) doesn't see it.

 I tried this a while back with an IDE to CF adapter.  Real nice looking 
 one too.

 It would constantly cause OpenBSD to panic.

 I would recommend against using this, unless you get real lucky.  If you 
 want
 flash to boot from, buy one of the ones that is specifically made for it 
 (not
 CF, but industrial grade flash meant to be a HDD).  Those things work a LOT
 better.  I can look up the details of the ones my friend uses if you'd 
 like.


 I was looking to run some tests with a CF boot drive before we get an
 X4540, which has a CF slot. The installer did see the attached USB sticks...
 My team does some of the testing inside Sun for the CF boot devices.
 We've used a number of IDE attaced CF adapters, such as;
 http://www.addonics.com/products/flash_memory_reader/ad44midecf.asp
 and also some random models from www.frys.com.
 We also test the CF boot feature on various Sun rack servers and blades
 that use a CF socket.

 I have not tested the SATA adapters but would not expect issues.
 I'd like to know if you find issues.


 The IDE attached devices use the legacy ATA/IDE device driver software,
 which had some bugs fixed for DMA and misc CF specific issues.
 It would be interesting to see if a SATA adapter for CF, set in bios to
 use AHCI instead of Legacy/IDE mode, would have any issues with
 the AHCI device driver software.  I've had no reason to test this yet, since
 the Sun HW models build the CF socket right onto the motherboard/bus.
 I can't find a reason to worry about hot-plug, since removing the boot
 drive while Solaris is running would be, um, somewhat interesting :-)

 True, the enterprise grade devices are higher quality and will last longer.
 But do not underestimate the current (2008) device wear leveling firmware
 that controls the CF memory usage, and hence life span.  Our in house
 destructive life span testing shows that the commercial grade CF device
 will last longer than the motherboard will.  The consumer grade devices
 
 Interesting thread - thanks to all the contributors.  I've seen, on
 several different forums, that many CF users lean towards Sandisk for
 reliability and longevity.  Does anyone else see consensus in terms of
 CF brands?
 
 that you find in the store or on mail order, may or may not be current
 generation, so your device lifespan will vary.  It should still be rather
 good for a boot device,  because Solaris does very little writing to the
 boot disk.  You can review configuration ideas to maximize the life
 of your CF device in this Solaris white paper for non-volatile memory;
 http://www.sun.com/bigadmin/features/articles/nvm_boot.jsp

 I hope this helps.

 Cheers,

 Neal Pollack

 Any further information welcome.

 Ian
 
 Regards,
 

-- 
//
// Nathan Kroenert  [EMAIL PROTECTED] //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] help me....

2008-08-04 Thread Nathan Kroenert
It starts with Z, which makes it the one of the last to be considered if 
it's listed alphabetically?

Nathan.

Rahul wrote:
 hi 
 can you give some disadvantages of the ZFS file system??
 
 plzz its urgent...
 
 help me.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS-L8i

2008-08-04 Thread Nathan Kroenert
And I can certainly vouch for that series of chipsets... I have a 
750a-sli chipset (the one below the 790) and the SATA ports (in AHCI 
mode) Just Work(tm) under nevada / opensolaris.

I'm yet to give it a while on S10, mostly as I pretty much run nevada 
everywhere... As S10 does indeed have an AHCI driver, I'd expect it 
would work just fine there too.

Oh - and the ports go like stink!*

For what it's worth, even with Nevada, you will need the newest NVidia 
Xorg drivers from nvidia's website to get the video working properly, 
and will need to add in it's PCI ID's in /etc/driver_aliases (And, as 
yet, I'm unable to run compiz in a stable way - Tends to hard lock up 
the machine after about 5 minutes use...), a very new hdaudio driver (I 
needed a bodgied up one from the Beijing team to make it work) and last 
I checked, the nvidia ethernet did not work properly without assigning 
it a valid ethernet address... (The driver misreads the ethernet address 
and either delivers it backwards, or byte-swaps... I don't remember 
exactly...)

Oh - And just in case you forget, most boards I have seen use IDE mode 
for the controllers by default, which reeks. Expect less than 15 MB/s if 
reading and writing at the same time if you forget to change the 
controller mode to AHCI!

For what it's worth, the board I'm using is a giga-byte..

   Manufacturer: Gigabyte Technology Co., Ltd.
   Product: M750SLI-DS4

Which also has the 6 X AHCI ports.

It might seem like it'll be a lot of hassle getting it working, but in 
the ZFS space, it works great pretty much out of the box (plus ethernet 
address change if the nvidia driver is still busted... ;)

Cheers!

Nathan.

*Going like stink means going like a hairy goat - like lightning - like 
s*it off a shovel - like a zyrtec - fast. :)

Brandon High wrote:
 On Mon, Aug 4, 2008 at 6:49 AM, Tim [EMAIL PROTECTED] wrote:
 really had the motivation or the cash to do so yet.  I've been keeping my
 eye out for a board that supports the opteron 165 and the wider lane dual
 pci-E slots that isn't stricly a *gaming* board.  I'm starting to think the
 combination doesn't exist.
 
 The AMD 790GX boards are starting to show up:
 http://www.newegg.com/Product/Product.aspx?Item=N82E16813128352
 
 Dual 8x PCIe slots, integrated video and 6 AHCI SATA ports.
 
 -B
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to delete hundreds of emtpy snapshots

2008-07-17 Thread Nathan Kroenert
In one of my prior experiments, I included the names of the snapshots I 
created in a plain text file.

I used this file, and not the zfs list output to determine which 
snapshots I was going to remove when it came time.

I don't even remember *why* I did that in the first place, but it 
certainly made things easier when it came time to clean up a whole bunch 
of stuff...

(And was not impacted by zfs list being non-snappy...)

The snapshot naming scheme meant that it was dead easy to work out which 
to remove / keep...

Right now, I don't have a system (that box was killed in a dreadful xen 
experiment :) so I'll be watching this thread with renewed interest to 
see who else is doing what...

Nathan.

Bob Friesenhahn wrote:
 On Thu, 17 Jul 2008, Ben Rockwood wrote:
 
 zfs list is mighty slow on systems with a large number of objects, 
 but there is no foreseeable plan that I'm aware of to solve that 
 problem.

 Never the less, you need to do a zfs list, therefore, do it once and 
 work from that.
 
 If the snapshots were done from a script then their names are easily 
 predictable and similar logic can be used to re-create the existing 
 names.  This avoids the need to do a 'zfs list'.
 
 Bob
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-07 Thread Nathan Kroenert
Even better would be using the ZFS block checksums (assuming we are only 
summing the data, not it's position or time :)...

Then we could have two files that have 90% the same blocks, and still 
get some dedup value... ;)

Nathan.

Charles Soto wrote:
 A really smart nexus for dedup is right when archiving takes place.  For
 systems like EMC Centera, dedup is basically a byproduct of checksumming.
 Two files with similar metadata that have the same hash?  They're identical.
 
 Charles
 
 
 On 7/7/08 4:25 PM, Neil Perrin [EMAIL PROTECTED] wrote:
 
 Mertol,

 Yes, dedup is certainly on our list and has been actively
 discussed recently, so there's hope and some forward progress.
 It would be interesting to see where it fits into our customers
 priorities for ZFS. We have a long laundry list of projects.
 In addition there's bug fixes  performance changes that customers
 are demanding.

 Neil.
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write / read speed and traps for beginners

2008-06-15 Thread Nathan Kroenert
Further followup to this thread...

After being beaten sufficiently with a clue-bat, it was determined that 
the nforce 750a could do ahci mode for it's SATA stuff.

I set it to ahci, and redid the devlinks etc and cranked it up as AHCI.

I'm now regularly peaking at 100MB/s, though spending most of the time 
around 70MB/s.

*much better*

The lesson here is: when in ahci mode in the bios, *don't* match that 
PCI-ID with the nv-sata driver. It's not what you want.

heh. *blush*.

Once I removed the extra nv_sata entries I had added to the 
driver_aliases in my miniroot, all was good.

On the NGE front, it turns out that solaris does not seem to like the 
ethernet address of the card. Trying to set it's OWN ethernet address 
using ifconfig yielded this:
# ifconfig nge0 ether 63:d0:b:7d:1d:0
ifconfig: dlpi_set_physaddr failed nge0: DLSAP address in improper 
format or invalid
ifconfig: failed setting mac address on nge0

using

ifconfig nge0 ether 0:e:c:5b:54:45

worked just fine, and the interface now passes traffic and sees 
responses just fine. So, the workaround here is adding
   ether a working ether address
in the hostname.nge0

I guess I'll log a bug on that on Monday...

Awesome. Now to work on audio...

heh.

Nathan.

Nathan Kroenert wrote:
 Hey all -
 
 Just spent quite some time trying to work out why my 2 disk mirrored ZFS 
 pool was running so slow, and found an interesting answer...
 
 System: new Gigabyte M750sli-DS4, AMD 9550, 4GB memory and 2 X Seagate 
 500GB SATA-II 32mb cache disks.
 
 The SATA ports on the nfoce 750asli chipset don't yet seem to be 
 supported by the nv_sata driver (I'm only running nv_89 at the mo, 
 though I'm not aware of new support going in just yet). I *can* get the 
 driver to attach, but not to see any disks. interesting, but I digress...
 
 Anyhoo, - I'm stuck in IDE compatability mode for the moment.
 
 So - using plain dd to the zfs filesystem on said disk
 
   dd if=/dev/zero of=delete.me bs=65536
 
 I could achieve only about 35-40MB/s write speed, whereas, if I dd to 
 the slice directly, I can get around 90-95MB/s
 
 I tried using whole disks versus a slice and it made no appreciable 
 difference.
 
 It turns out that when you are in IDE compatability mode, having two 
 disks on the same 'controller' (c# in solaris) behaves just like real 
 IDE... Crap!
 
 Moving the second disk onto from c1 to c2 got be back to at least 50MB/s 
 with higher peaks, up to 60/70MB/s.
 
 Also of note, on the gigabyte board (and I guess other nforce 750asli 
 based chipsets) only 4 of the 6 SATA ports work when in IDE mode.
 
 Other thoughts on the Nforce 750a:
   - nge plumbs up OK and can send and 'see' packets, but does not seem 
 to know itself... In promiscuous mode, you can see returning icmp echo 
 requests, but they don't make it to the top of the stack.
 I had to use an e1000g in a PCI slot to get my networking working 
 properly...
   - Onboard Video works, including compiz, but you need to create an 
 xorg.conf and update the nvidia driver with the latest from the nvidia 
 website
 
 Seems snappy enough. With 4 cores @ 2.2Ghz (phenom 9550) it's looking 
 like it'll do what I wanted quite nicely.
 
 Later...
 
 Nathan.
 
 
 

-- 
//
// Nathan Kroenert  [EMAIL PROTECTED] //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS write / read speed and traps for beginners

2008-06-14 Thread Nathan Kroenert
Hey all -

Just spent quite some time trying to work out why my 2 disk mirrored ZFS 
pool was running so slow, and found an interesting answer...

System: new Gigabyte M750sli-DS4, AMD 9550, 4GB memory and 2 X Seagate 
500GB SATA-II 32mb cache disks.

The SATA ports on the nfoce 750asli chipset don't yet seem to be 
supported by the nv_sata driver (I'm only running nv_89 at the mo, 
though I'm not aware of new support going in just yet). I *can* get the 
driver to attach, but not to see any disks. interesting, but I digress...

Anyhoo, - I'm stuck in IDE compatability mode for the moment.

So - using plain dd to the zfs filesystem on said disk

  dd if=/dev/zero of=delete.me bs=65536

I could achieve only about 35-40MB/s write speed, whereas, if I dd to 
the slice directly, I can get around 90-95MB/s

I tried using whole disks versus a slice and it made no appreciable 
difference.

It turns out that when you are in IDE compatability mode, having two 
disks on the same 'controller' (c# in solaris) behaves just like real 
IDE... Crap!

Moving the second disk onto from c1 to c2 got be back to at least 50MB/s 
with higher peaks, up to 60/70MB/s.

Also of note, on the gigabyte board (and I guess other nforce 750asli 
based chipsets) only 4 of the 6 SATA ports work when in IDE mode.

Other thoughts on the Nforce 750a:
  - nge plumbs up OK and can send and 'see' packets, but does not seem 
to know itself... In promiscuous mode, you can see returning icmp echo 
requests, but they don't make it to the top of the stack.
I had to use an e1000g in a PCI slot to get my networking working 
properly...
  - Onboard Video works, including compiz, but you need to create an 
xorg.conf and update the nvidia driver with the latest from the nvidia 
website

Seems snappy enough. With 4 cores @ 2.2Ghz (phenom 9550) it's looking 
like it'll do what I wanted quite nicely.

Later...

Nathan.



-- 
//
// Nathan Kroenert  [EMAIL PROTECTED] //
// Systems Engineer Phone:  +61 3 9869-6255 //
// Sun Microsystems Fax:+61 3 9869-6288 //
// Level 7, 476 St. Kilda Road  Mobile: 0419 305 456//
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA controller suggestion

2008-06-07 Thread Nathan Kroenert
Tim wrote:
 
 
 
 **pci or pci-x.  Yes, you might see *SOME* loss in speed from a pci 
 interface, but let's be honest, there aren't a whole lot of users on 
 this list that have the infrastructure to use greater than 100MB/sec who 
 are asking this sort of question.  A PCI bus should have no issues 
 pushing that.
 

Hm.

If it's a system with only 1 PCI bus, there are still a few things to 
consider here.

If it's plain old 33mhz, 32 bit PCI your 100MB/s(ish) usable bandwidth 
is actually total bandwidth. That's 50MB/s in and 50MB/s out, if you are 
copying disk to disk...

I am about to update my home server for exactly the issue of saturating 
my PCI bus... It's even worse for me, as I'm mirroring, so, that works 
out to closer to 33MB/s read, 33MB/s write + 33 MB/s write to the mirror.

All in all, it blows.

I'm looking into one of the new gigabyte NVIDIA based systems with the 
750aSLI chipsets. I'm *hoping* the Solaris nv_sata drivers will work 
with the new chipset (or that we are on the way to updating them...).

My other box that's using the Nforce 570 works like a champ, and I'm 
hoping to recapture that magic. (I actually wanted to buy some more 570 
based MB's but cannot get 'em in Australia any more... :)

Cheers!

Nathan.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] More USB Storage Issues

2008-06-05 Thread Nathan Kroenert
For what it's worth, I started playing with USB + flash + ZFS and was 
most unhappy for quite a while.

I was suffering with things hanging, going slow or just going away and 
breaking, and thought I was witnessing something zfs was doing as I was 
trying to do mirror recovery and all that sort of stuff.

On a hunch, I tried doing UFS and RAW instead and saw the same issues.

It's starting to look like my USB hubs. Once they are under any 
reasonable read/write load, they just make bunches of things go offline.

Yep - They are powered and plugged in.

So, at this stage, I'll be grabbing a couple of 'better' USB hubs (Mine 
are pretty much the cheapest I could buy) and see how that goes.

For gags, take ZFS out of the equation and validate that your hardware 
is actually providing a stable platform for ZFS... Mine wasn't...

Nathan.

Evan Geller wrote:
 So, I've been stuck in kind of an ugly pattern. I zpool create and nothing 
 goes wrong for a while, and then eventually I'll zpool status, which doesn't 
 respond to ^C or kill -9s or anything. Also, setting NOINUSE_CHECK=1 doesn't 
 appear to make a difference. I'll try and truss it next time I get a chance 
 if that helps. 
 
 Anywho, other problem is I get a huge storm of these around the same time 
 zpool hangs.
 
 Jun  4 23:17:59 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
 PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 (sd5):
 Jun  4 23:17:59 cakeoffline or reservation conflict
 Jun  4 23:18:00 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
 PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 (sd5):
 Jun  4 23:18:00 cakeoffline or reservation conflict
 Jun  4 23:18:01 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
 PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 (sd6):
 Jun  4 23:18:01 cakeoffline or reservation conflict
 Jun  4 23:18:02 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
 PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 (sd6):
 Jun  4 23:18:02 cakeoffline or reservation conflict
 Jun  4 23:18:03 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
 PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 (sd6):
 Jun  4 23:18:03 cakeoffline or reservation conflict
 Jun  4 23:18:04 cake scsi: [ID 107833 kern.warning] WARNING: /[EMAIL 
 PROTECTED],0/pci8086,[EMAIL PROTECTED],7/[EMAIL PROTECTED]/[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 (sd6):
 Jun  4 23:18:04 cakeoffline or reservation conflict
 Jun  4 23:18:04 cake zfs: [ID 664491 kern.warning] WARNING: Pool 'tank' has 
 encountered an uncorrectable I/O error. Manual intervention is required.
 
 Sorry if this isn't enough information, but if there's anything else I can 
 provide that'll help please let me know.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
//
// Nathan Kroenert  [EMAIL PROTECTED] //
// Technical Support Engineer   Phone:  +61 3 9869-6255 //
// Sun Services Fax:+61 3 9869-6288 //
// Level 3, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Get your SXCE on ZFS here!

2008-06-04 Thread Nathan Kroenert
format -e is your window to cache settings.

As for the auto-enabling, I'm not sure, as IIRC, we do different things 
based on disk technology.

eg: IDE + SATA - Always enabled
 SCSI - Disabled by default, unless you give ZFS the whole disk.

I think.

On a couple of my systems, this seems to ring true.

Not at all sure about SAS.

If I'm wrong here, hopefully someone else will provide the complete set 
of logic for determining cache enabling semantics.

:)

Nathan.

Brian Hechinger wrote:
 On Wed, Jun 04, 2008 at 09:17:05PM -0400, Ellis, Mike wrote:
 The FAQ document (
 http://opensolaris.org/os/community/zfs/boot/zfsbootFAQ/ ) has a
 jumpstart profile example:
 
 Speaking of the FAQ and mentioning the need to use slices, how does that
 affect the ability of Solaris/ZFS to automatically enable the disk's
 cache?  Does it need to be manually over-ridden (unlike giving ZFS the
 whole disk where it automatically turns the disk cache on)?
 
 Also, how can you check if the disk's cache has been enabled or not?
 
 Thanks,
 
 -brian

-- 
//
// Nathan Kroenert  [EMAIL PROTECTED] //
// Technical Support Engineer   Phone:  +61 3 9869-6255 //
// Sun Services Fax:+61 3 9869-6288 //
// Level 3, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS root finally here in SNV90

2008-06-04 Thread Nathan Kroenert
I'd expect it's the old standard.

if /var/tmp is filled, and that's part of /, then bad things happen.

there are often other places in /var that are writable by more than 
root, and always the possibility that something barfs heavily into syslog.

Since the advent of reasonably sized disks, I know many don't consider 
this an issue these days, but I'd still be inclined to keep /var (and 
especially /var/tmp) separated from /

In ZFS, this is, of course, just two filesystems in the same pool, with 
differing quotas...

:)

Nathan.

Rich Teer wrote:
 On Wed, 4 Jun 2008, Bob Friesenhahn wrote:
 
 Did you actually choose to keep / and /var combined?  Is there any 
 
 THat's what I'd do...
 
 reason to do that with a ZFS root since both are sharing the same pool 
 and so there is no longer any disk space advantage?  If / and /var are 
 not combined can they have different assigned quotas without one 
 inheriting limits from the other?
 
 Why would one do that?  Just keep an eye on the root pool and all is good.
 

-- 
//
// Nathan Kroenert  [EMAIL PROTECTED] //
// Technical Support Engineer   Phone:  +61 3 9869-6255 //
// Sun Services Fax:+61 3 9869-6288 //
// Level 3, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 Thumper panic

2008-05-11 Thread Nathan Kroenert - Server ESG
Dumping to /dev/dsk/c6t0d0s1

certainly looks like a non-mirrored dump dev...

You  might try a manual savecore telling it to ignore the dump valid 
header and see what you get...

savecore -d

and perhaps try telling it to look directly at the dump device...

savecore -f device

You should also, when you get the chance, deliberately panic the box to 
make sure you can actually capture a dump...

dumpadm is your friend as far as checking where you are going to dump 
to, and it it's one side of your swap mirror, that's bad, M'Kay?

:)

Nathan.

Jorgen Lundman wrote:
 OK, this is a pretty damn poor panic report if I may say no, not had 
 much sleep.
 
  Solaris Express Developer Edition 9/07 snv_70b X86
 Copyright 2007 Sun Microsystems, Inc.  All Rights Reserved.
  Use is subject to license terms.
  Assembled 30 August 2007
 
 SunOS x4500-01.unix 5.11 snv_70b i86pc i386 i86pc
 
 Even though it dumped, it wrote nothing to /var/crash/. Perhaps because 
 swap is mirrored.
 
 
 
 Jorgen Lundman wrote:
 We had a panic around noon on Saturday, which it mostly recovered 
 itself. All ZFS NFS exports just remounted, but the UFS on zdev NFS 
 exports did not, needed manual umount  mount on all clients for some 
 reason.

 Is this a known bug we should consider a patch for?



 May 10 11:49:46 x4500-01.unix ufs: [ID 912200 kern.notice] quota_ufs:
 over hard
 disk limit (pid 477, uid 127409, inum 1047211, fs /export/zero1)
 May 10 11:51:26 x4500-01.unix unix: [ID 836849 kern.notice]
 May 10 11:51:26 x4500-01.unix ^Mpanic[cpu3]/thread=17b8c820:
 May 10 11:51:26 x4500-01.unix genunix: [ID 335743 kern.notice] BAD TRAP:
 type=e
 (#pf Page fault) rp=ff001f4ca220 addr=0 occurred in module
 unknown due t
 o a NULL pointer dereference
 May 10 11:51:26 x4500-01.unix unix: [ID 10 kern.notice]
 May 10 11:51:26 x4500-01.unix unix: [ID 839527 kern.notice] nfsd:
 May 10 11:51:26 x4500-01.unix unix: [ID 753105 kern.notice] #pf Page fault
 May 10 11:51:26 x4500-01.unix unix: [ID 532287 kern.notice] Bad kernel
 fault at
 addr=0x0
 May 10 11:51:26 x4500-01.unix unix: [ID 243837 kern.notice] pid=477,
 pc=0x0, sp=
 0xff001f4ca318, eflags=0x10246
 May 10 11:51:26 x4500-01.unix unix: [ID 211416 kern.notice] cr0:
 8005003bpg,wp,
 ne,et,ts,mp,pe cr4: 6f8xmme,fxsr,pge,mce,pae,pse,de
 May 10 11:51:26 x4500-01.unix unix: [ID 354241 kern.notice] cr2: 0 cr3:
 1fcbbc00
 0 cr8: c
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rdi:
 fffedef
 ea000 rsi:9 rdx:0
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rcx:
 17b
 8c820  r8:0  r9: ff054797dc48
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] rax:

  0 rbx:  97eaffc rbp: ff001f4ca350
 May 10 11:51:26 x4500-01.unix unix: [ID 592667 kern.notice] r10:

  0 r11: fffec8b93868 r12: 27991000
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] r13:
 fffed1b
 59c00 r14: fffecf8d8cc0 r15: 1000
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] fsb:

  0 gsb: fffec3d5a580  ds:   4b
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]  es:

 4b  fs:0  gs:  1c3
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice] trp:

  e err:   10 rip:0
 May 10 11:51:27 x4500-01.unix unix: [ID 592667 kern.notice]  cs:

 30 rfl:10246 rsp: ff001f4ca318
 May 10 11:51:27 x4500-01.unix unix: [ID 266532 kern.notice]  ss:

 38
 May 10 11:51:27 x4500-01.unix unix: [ID 10 kern.notice]
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca100
 unix:die+c8 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca210
 unix:trap+135b ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca220
 unix:_cmntrap+e9 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 802836 kern.notice]
 ff001f4ca350
 0 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca3d0
 ufs:top_end_sync+cb ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca440
 ufs:ufs_fsync+1cb ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca490
 genunix:fop_fsync+51 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4ca770
 nfssrv:rfs3_create+604 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4caa70
 nfssrv:common_dispatch+444 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4caa90
 nfssrv:rfs_dispatch+2d ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4cab80
 rpcmod:svc_getreq+1c6 ()
 May 10 11:51:27 x4500-01.unix genunix: [ID 655072 kern.notice]
 ff001f4cabf0

Re: [zfs-discuss] zfs data corruption

2008-04-27 Thread Nathan Kroenert - Server ESG
Note: IANATZD (I Am Not A Team-ZFS Dude)

Speaking as a Hardware Guy, knowing that something is happening, has 
happened or is indicated to happen is a Good Thing (tm).

Begin unlikely, but possible scenario:

If, for instance, I'm getting a cluster of read errors (or, perhaps bad 
blocks), I could:
  - See it as it's happening
  - See the block number for each error
  - already know the rate at which the errors are happening
  - Be able to determine that it's not good, and it's time to replace 
the disk.
  - You get the picture...

And based on this information, I could feel confident that I have the 
right information at hand to be able to determine that it is or is not 
time to replace this disk.

Of course, that assumes:
  - I know anything about disks
  - I know anything about the error messages
  - I have some sort of logging tool that recognises the errors (and 
does not just throw out the 'retryable ones', as most I have seen are 
configured to do)
  - I care
  - The folks watching the logs in the enterprise management tool care
  - My storage even bothers to report the errors

Certainly, for some organisations, all of the above are exactly how it 
works, and it works well for them.

Looking at the ZFS/FMA approach, it certainly is somewhat different.

The (very) rough concept is that FMA gets pretty much all errors 
reported to it. It logs them, in a persistent store, which is always 
available to view. It also makes diagnoses on the errors, based on the 
rules that exist for that particular style of error. Once enough (or the 
right type of) errors happen, it'll then make a Fault Diagnosis for that 
component, and log a message, loud and proud into the syslog. It may 
also take other actions, like, retire a page of memory, offline a CPU, 
panic the box, etc.

So - That's the rough overview.

It's worth noting up front that we can *observe* every event that has 
happened. Using fmdump and fmstat we can immediately see if anything 
interesting has been happening, or we can wait for a Fault Diagnosis, in 
which case, we can just watch /var/adm/messages.

I also *believe* (though am not certain - Perhaps someone else on the 
list might be?) it would be possible to have each *event* (so - the 
individual events that lead to a Fault Diagnosis) generate a message if 
it was required, though I have never taken the time to do that one...

There are many advantages to this approach - It does not rely on 
logfiles, offsets into logfiles, counters of previously processes 
messages and all of the other doom and gloom that comes with scraping 
logfiles. It's something you can simply ask: Any issues, chief? The 
answer is there in a flash.

You will also be less likely to have the messages rolled out of the logs 
before you get to them (another classic...).

And - You get some great details from fmdump showing you what's really 
going on, and it's something that's really easy to parse to look for 
patterns.

All of this said, I understand if you feel things are being 'hidden' 
from you until it's *actually* busted that you are having some of your 
forward vision obscured 'in the name of a quiet logfile'. I felt much 
the same way for a period of time. (Though, I live more in the CPU / 
Memory camp...)

But - Once I realised what I could do with fmstat and fmdump, I was not 
the slightest bit unhappy (Actually, that's not quite true... Even once 
I knew what they could do, it still took me a while to work out the 
options I cared about for fmdump / fmstat), but I now trust FMA to look 
after my CPU / Memory issues better than I would in real life. I can 
still get what I need when I want to, and the data is actually more 
accessible and interesting. I just needed to know where to go looking.

All this being said, I was not actually aware that many of our disk / 
target drivers were actually FMA'd up yet. heh - Shows what I know.

Does any of this make you feel any better (or worse)?

Nathan.

Mark A. Carlson wrote:
 fmd(1M) can log faults to syslogd that are already diagnosed. Why
 would you want the random spew as well?
 
 -- mark
 
 Carson Gaspar wrote:
 [EMAIL PROTECTED] wrote:

   
 It's not safe to jump to this conclusion.  Disk drivers that support FMA
 won't log error messages to /var/adm/messages.  As more support for I/O
 FMA shows up, you won't see random spew in the messages file any more.
 

 mode=large financial institution paying support customer
 That is a Very Bad Idea. Please convey this to whoever thinks that 
 they're helping by not sysloging I/O errors. If this shows up in 
 Solaris 11, we will Not Be Amused. Lack of off-box error logging will 
 directly cause loss of revenue.
 /mode

   
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss

Re: [zfs-discuss] zfs data corruption

2008-04-23 Thread Nathan Kroenert
I'm just taking a stab here, so could be completely wrong, but IIRC, 
even if you disable checksum, it still checksums the metadata...

So, it could be metadata checksum errors.

Others on the list might have some funky zdb thingies you could to see 
what it actually is...

Note: typed pre caffeine... :)

Nathan

Vic Engle wrote:
 I'm hoping someone can help me understand a zfs data corruption symptom. We 
 have a zpool with checksum turned off. Zpool status shows that data 
 corruption occured. The application using the pool at the time reported a 
 read error and zoppl status (see below) shows 2 read errors on a device. 
 The thing that is confusing to me is how ZFS determines that data corruption 
 exists when reading data from a pool with checkdum turned off.
 
 Also, I'm wondering about the persistent errors in the output below. Since no 
 specific file or directory is mentioned does this indicate pool metadata is 
 corrupt?
 
 Thanks for any help interpreting the output...
 
 
 # zpool status -xv
   pool: zpool1
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: none requested
 config:
 
 NAME STATE READ WRITE CKSUM
 zpool1   ONLINE   2 0 0
   c4t60A9800043346859444A476B2D48446Fd0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D484352d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D484236d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D482D6Cd0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D483951d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D483836d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D48366Bd0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D483551d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D483435d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D48326Bd0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D483150d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D483035d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D47796Ad0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D477850d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D477734d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D47756Ad0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D47744Fd0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D477333d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D477169d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D47704Ed0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D476F33d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D476D68d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D476C4Ed0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D476B32d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D476968d0  ONLINE   0 0 0
   c4t60A98000433468656834476B2D453974d0  ONLINE   0 0 0
   c4t60A98000433468656834476B2D454142d0  ONLINE   0 0 0
   c4t60A98000433468656834476B2D454255d0  ONLINE   0 0 0
   c4t60A98000433468656834476B2D45436Dd0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D487346d0  ONLINE   2 0 0
   c4t60A9800043346859444A476B2D487175d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D48705Ad0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486F45d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486D74d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486C5Ad0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486B44d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486974d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486859d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486744d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486573d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486459d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486343d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D486173d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D482F58d0  ONLINE   0 0 0
   c4t60A9800043346859444A476B2D485A43d0  ONLINE   0 0 0

Re: [zfs-discuss] Zfs send takes 3 days for 1TB?

2008-04-09 Thread Nathan Kroenert
Indeed -

If it was 100Mb/s ethernet, 1TB would take near enough 24 hours just to 
push that much data...

Would be great to see some details of the setup and where the bottleneck 
was. I'd be surprised if ZFS has anything to do with the transfer rate...

But an interesting read anyways. :)

Nathan.



Nicolas Williams wrote:
 On Wed, Apr 09, 2008 at 11:38:03PM -0400, Jignesh K. Shah wrote:
 Can zfs send utilize multiple-streams of data transmission (or some sort 
 of multipleness)?

 Interesting read for background
 http://people.planetpostgresql.org/xzilla/index.php?/archives/338-guid.html

 Note: zfs send takes 3 days for 1TB to another system
 
 Huh?  That article doesn't describe how they were moving the zfs send
 stream, whether the limit was the network, ZFS or disk I/O.  In fact,
 it's bereft of numbers.  It even says that the transfer time is not
 actually three days but upwards of 24 hours.
 
 Nico

-- 
//
// Nathan Kroenert  [EMAIL PROTECTED] //
// Technical Support Engineer   Phone:  +61 3 9869-6255 //
// Sun Services Fax:+61 3 9869-6288 //
// Level 3, 476 St. Kilda Road  //
// Melbourne 3004   VictoriaAustralia   //
//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Sun X2100?

2008-03-19 Thread Nathan Kroenert
Did you do anything specific with the drive caches?

How is your ZFS performance?

Nathan. :)

Rich Teer wrote:
 On Wed, 19 Mar 2008, Terence Ng wrote:
 
 I am new to Solaris. I have Sun X2100 with 2 x 80G harddisks (run as
 email server, run tomcat, jboss and postgresql) and want to run as
 mirror to secure the data. Since ZFS cannot be used as a root file
 system , does that mean I am no way can benefit from using ZFS?
 
 Nope, in fact I have set up an X2100 pretty much exactly as you want.
 set up 5 partitions: /, swap, space for live upgrade, a small partition
 for the SVM metadbs, and the rest of the disk.  This last one is used
 as that machines zdev for its ZFS pool.
 
 So, root and swap mirrored using SVM, and everything else on a mirrored
 ZFS pool.
 
 HTH,
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs 32bits

2008-03-06 Thread Nathan Kroenert
Paul -

Don't substitute redundancy for backup...

if your data is important to you, for the love of steak, make sure you 
have a backup that would not be destroyed by, say, a lightening strike, 
fire or stray 747.

For what it's worth, I'm also using ZFS on 32 bit and am yet to 
experience any sort of issues.

An external 500GB disk + external USB enclosure runs for what - $150?

That's what I use anyways. :)

Nathan.

Paul Kraus wrote:
 On Thu, Mar 6, 2008 at 10:22 AM, Brian D. Horn [EMAIL PROTECTED] wrote:
 
 ZFS is not 32-bit safe.  There are a number of places in the ZFS code where
  it is assumed that a 64-bit data object is being read atomically (or set
  atomically).  It simply isn't true and can lead to weird and bugs.
 
 This is disturbing, especially as I have not seen this
 documented anywhere. I have a dual P-III 550 Intel system with 1 GB of
 RAM (Intel L440GX+ motherboard). I am running Solaris 10U4 and am
 using ZFS (mirrors and stripes only, no RAIDz). While this is 'only' a
 home server, I still cannot afford to lose over 500 GB of data. If ZFS
 isn't supported under 32 bit systems then I need to start migrating to
 UFS/SLVM as soon as I can. I specifically went with 10U4 so that I
 would have a stable, supportable environment.
 
 Under what conditions are the 32 bit / 64 bit problems likely
 to occur ? I have been running this system for 6 months (a migration
 from OpenSuSE 10.1) without any issues. The NFS server performance is
 at least an order of magnitude better than the SuSE server was.
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Nathan Kroenert
Bob Friesenhahn wrote:
 On Tue, 4 Mar 2008, Nathan Kroenert wrote:

 It does seem that some of us are getting a little caught up in disks 
 and their magnificence in what they write to the platter and read 
 back, and overlooking the potential value of a simple (though 
 potentially computationally expensive) circus trick, which might, just 
 might, make your broken 1TB archive useful again...
 
 The circus trick can be handled via a user-contributed utility.  In 
 fact, people can compete with their various repair utilities.  There are 
 only 1048576 1-bit permuations to try, and then the various two-bit 
 permutations can be tried.

That does not sound 'easy', and I consider that ZFS should be... :) and 
IMO it's something that should really be built in, not attacked with an 
addon.

I had (as did Jeff in his initial response) considered that we only need 
to actually try to flip 128KB worth of bits once... That many flips 
means that we in a way 'processing' some 128GB in the worse case when 
re-generating checksums.  Internal to a CPU, depending on Cache 
Aliasing, competing workloads, threadedness, etc, this could be 
dramatically variable... something I guess the ZFS team would want to 
keep out of the 'standard' filesystem operation... hm. :\

 I don't think it's a good idea for us to assume that it's OK to 'leave 
 out' potential goodness for the masses that want to use ZFS in 
 non-enterprise environments like laptops / home PC's, or use commodity 
 components in conjunction with the Big Stuff... (Like white box PC's 
 connected to an EMC or HDS box... )
 
 It seems that goodness for the masses has not been left out.  The 
 forthcoming ability to request duplicate ZFS blocks is very good news 
 indeed.  We are entering an age where the entry level SATA disk is 1TB 
 and users have more space than they know what to do with.  A little 
 replication gives these users something useful to do with their new disk 
 while avoiding the need for unreliable circus tricks to recover data.  
 ZFS goes far beyond MS-DOS's recover command (which should have been 
 called destroy).

I never have enough space on my laptop... I guess I'm a freak.

But - I am sure that we are *both* right for some subsets of ZFS users, 
and that the more choice we have built into the filesystem, the better.

Thanks again for the comments!

Nathan.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-03 Thread Nathan Kroenert
Hey, Bob

My perspective on Big reasons for it *to* be integrated would be:
  - It's tested - By the folks charged with making ZFS good
  - It's kept in sync with the differing Zpool versions
  - It's documented
  - When the system *is* patched, any changes the patch brings are 
synced with the recovery mechanism
  - Being integrated, it has options that can be persistently set if 
required
  - It's there when you actually need it
  - It could be integrated with Solaris FMA to take some funky actions 
based on the nature of the failure, including cool messages telling you 
what you need to run to attempt a repair etc
  - It's integrated (recursive, self fulfilling benefit... ;)

As for the separate utility for different failure modes, I agree, 
*development* of these might be faster if everyone chases their own pet 
failure mode and contributes it, but I still think getting them 
integrated either as optional actions on error, or as part of zdb or 
other would be far better than having to go looking for the utility and 
'give it a whirl'.

But - I'm sure that's a personal preference, and I'm sure that there are 
those that would love the opportunity to roll their own.

OK - I'm going to shutup now. I think I have done this to death, and I 
don't want to end up in everyone's kill filter.

Cheers!

Nathan.



Bob Friesenhahn wrote:
 On Tue, 4 Mar 2008, Nathan Kroenert wrote:
 The circus trick can be handled via a user-contributed utility.  In fact, 
 people can compete with their various repair utilities.  There are only 
 1048576 1-bit permuations to try, and then the various two-bit permutations 
 can be tried.
 That does not sound 'easy', and I consider that ZFS should be... :) and IMO 
 it's something that should really be built in, not attacked with an addon.
 
 There are several reasons why this sort of thing should not be in ZFS 
 itself.  A big reason is that if it is in ZFS itself, it can only be 
 updated via an OS patch or upgrade, along with a required reboot.  If 
 it is in a utility, it can be downloaded and used as the user sees fit 
 without any additional disruption to the system.  While some errors 
 are random, others follow well defined patterns, so it may be that one 
 utility is better than another or that user-provided options can help 
 achieve success faster.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vs. Novell NSS

2008-02-28 Thread Nathan Kroenert - Server ESG
Hm -

Based on this detail from the page:

Change lever for switching between Rotation
   + Hammering , Neutral and Hammering only

I'd hope it could still hammer... Though I'd suspect the size of nails 
it would hammer would be somewhat limited... ;)

Nathan.

Boyd Adamson wrote:
 Richard Elling [EMAIL PROTECTED] writes:
 Tim wrote:
 The greatest hammer in the world will be inferior to a drill when 
 driving a screw :)

 The greatest hammer in the world is a rotary hammer, and it
 works quite well for driving screws or digging through degenerate
 granite ;-)  Need a better analogy.
 Here's what I use (quite often) on the ranch:
 http://www.hitachi-koki.com/powertools/products/hammer/dh40mr/dh40mr.html
 
 Hasn't the greatest hammer in the world lost the ability to drive
 nails? 
 
 I'll have to start belting them in with the handle of a screwdriver...
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-26 Thread Nathan Kroenert
Are you indicating that the filesystem know's or should know what an 
application is doing??

It seems to me that to achieve what you are suggesting, that's exactly 
what it would take.

Or, you are assuming that there are no co-dependent files in 
applications that are out there...

Whichever the case, I'm confused...!

Unless you are perhaps suggesting perhaps an IOCTL that an application 
could call to indicate I'm done for this round, please snapshot or 
something to that effect. Even then, I'm still confused as to how I 
would do anything much useful with this over and above, say, 1 minute 
snapshots.

Nathan.


Uwe Dippel wrote:
 atomic view?
 
 Your post was on the gory details on how ZFS writes. Atomic View here is, 
 that 'save' of a file is an 'atomic' operation: at one moment in time you 
 click 'save', and some other moment in time it is done. It means indivisible, 
 and from the perspective of the user this is how it ought to look.
 
 The rub is this: how do you know when a file edit/modify has completed?
 
 Not to me, I'm sorry, this is task of the engineer, the implementer. (See 
 'atomic', as above.)
 It would be a shame if a file system never knew if the operation was 
 completed.
 
 If an application has many files then an edit/modify may include
 updates and/or removals of more than one file. So once again: how do
 you know when an edit/modify has completed?
 
 So an 'edit' fires off a few child processes to do this and that and then you 
 forget about them, hoping for them to do a proper job. 
 Oh, this gives me confidence ;)
 
 No, seriously, let's look at some applications:
 
 A. User works in Office (Star-Office, sure!) and clicks 'Save' for a current 
 work before making major modifications. So the last state of the document 
 (odt) is being stored. Currently we can set some Backup option to be done 
 regularly. Meaning that the backup could have happened at the very wrong 
 moment; while saving the state on each user request 'Save' is much better.
 
 B. A bunch of e-mails are read from the Inbox and stored locally (think 
 Maildir). The user sees the sender, doesn't know her, and deletes all of 
 them. Of course, the deletion process will have fired up the CDP-engine 
 ('event') and retire the files instead of deletion. So when the sender calls, 
 and the user learns that he made a big mistake, he can roll back to before 
 the deletion (event).
 
 C. (Sticking with /home/) I agree with you, that the rather continuous 
 changes of the dot-files and dot-directories in the users HOME that serve 
 JDS, and many more, do eventually not necessarily allow to reconstitute a 
 valid state of the settings at all and any moment. Still, chances are high, 
 that they will. In the worst case, the unlucky user can roll back to when he 
 last took a break, if only for grabbing another coffee, because it took a 
 minute, the writes (see above) will hopefully have completed. oh, s***, 
 already messed up the settings? Then try to roll back to lunch break. Works? 
 Okay! But when you roll back to lunch break, where is the stuff done in 
 between? The backup solution means that they are lost. The event-driven (CDP) 
 not: you can roll over all the states of files or directories between lunch 
 break and recover the third latest version of your tendering document (see 
 above), within the settings of the desktop that were valid this morning.
 
 Maybe SUN can't do this, but wait for Apple, and OSX10-dot-something (using 
 ZFS as default!) will know how to do it. (And they probably also know, when 
 their 'writes' are done.)
 
 Uwe
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-26 Thread Nathan Kroenert
It occurred to me that we are likely missing the point here because Uwe 
is thinking of this as a One User on a System sort of perspective, 
whereas most of the rest of us are thinking of it from a 'Solaris' 
perspective, where we are typically expecting the system to be running 
many applications / DB's / users all at the same time.

In Uwe's use cases thus far, it seems that he is interested in only the 
simple single user style applications, if I'm not mistaken, so he's not 
considering the consequences of what it *really* means to have CDP in 
the way he wishes.

Uwe - am I close here?

Nathan.


Nicolas Williams wrote:
 On Tue, Feb 26, 2008 at 06:34:04PM -0800, Uwe Dippel wrote:
 The rub is this: how do you know when a file edit/modify has completed?
 Not to me, I'm sorry, this is task of the engineer, the implementer.
 (See 'atomic', as above.) It would be a shame if a file system never
 knew if the operation was completed.
 
 The filesystem knows if a filesystem operation completed.  It can't know
 application state.  You keep missing that.
 
 If an application has many files then an edit/modify may include
 updates and/or removals of more than one file. So once again: how do
 you know when an edit/modify has completed?
 So an 'edit' fires off a few child processes to do this and that and
 then you forget about them, hoping for them to do a proper job.  Oh,
 this gives me confidence ;)
 
 You'd rather the filesystem guess application state than have the app
 tell it?  Weird.  Your other alternative -- saving a history of every
 write -- doesn't work because you can't tell what point in time is safe
 to restore to.
 
 No, seriously, let's look at some applications:

 A. User works in Office (Star-Office, sure!) and clicks 'Save' for a
 current work before making major modifications. So the last state of
 the document (odt) is being stored. Currently we can set some Backup
 option to be done regularly. Meaning that the backup could have
 happened at the very wrong moment; while saving the state on each user
 request 'Save' is much better.
 
 So modify the office suite to call a new syscall that says I'm
 internally consistent in all these files and boom, the filesystem can
 now take a useful snapshot.
 
 B. A bunch of e-mails are read from the Inbox and stored locally
 (think Maildir). The user sees the sender, doesn't know her, and
 deletes all of them. Of course, the deletion process will have fired
 up the CDP-engine ('event') and retire the files instead of deletion.
 So when the sender calls, and the user learns that he made a big
 mistake, he can roll back to before the deletion (event).
 
 Now think of an application like this but which uses, say, SQLite (e.g.,
 Firefox 3.x, Thunderbird, ...).  The app might never close the database
 file, just fsync() once in a while.  The DB might have multiple files
 (in the SQLite case that might be multiple DBs ATTACHed into one
 database connection).  Also, an fsync of a SQLite journal file is not
 as useful to CDP as an fsync() of a SQLite DB proper.  Now add any of a
 large number of databases and apps to the mix and forget it -- the
 heuristics become impossible or mostly useless.
 
 C. (Sticking with /home/) I agree with you, that the rather continuous
 changes of the dot-files and dot-directories in the users HOME that
 serve JDS, and many more, do eventually not necessarily allow to
 reconstitute a valid state of the settings at all and any moment.
 Still, chances are high, that they will. In the worst case, the
 
 Chances?  So what, we tell the user try restoring to this snapshot,
 login again and if stuff doesn't work, then try another snapshot?  What
 if the user discovers too late that the selected snapshot was
 inconsistent and by then they've made other changes?
 
 unlucky user can roll back to when he last took a break, if only for
 grabbing another coffee, because it took a minute, the writes (see
 
 That sounds mighty painful.
 
 I'd rather modify some high-profile apps to tell the filesystem that
 their state is consistent, so take a snapshot.
 
 Maybe SUN can't do this, but wait for Apple, and OSX10-dot-something
 (using ZFS as default!) will know how to do it. (And they probably
 also know, when their 'writes' are done.)
 
 I'm giving you the best answer -- modify the apps -- and you reject it.
 Given how many important apps Apple controls it wouldn't surprise me if
 they did what I suggest.  We should do it too.  But one step at a time.
 We need to setup a project, gather requirements, design a solution, ...
 And since the solution will almost certainly entail modifications to
 apps where heuristics won't help, well, I think this would be a project
 with fairly wide scope, which means it likely won't go fast.
 
 Nico
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cause for data corruption?

2008-02-25 Thread Nathan Kroenert
My guess is that you have some defective hardware in the system that's 
causing bit flips in the checksum or the data payload.

I'd suggest running some sort of system diagnostics for a few hours to 
see if you can locate the bad piece of hardware.

My suspicion would be your memory or CPU, but that's just a wild guess, 
based on the number of errors you have and the number of devices it's 
spread over.

Could it be that you have been corrupting data for some time and now 
known it?

Oh - And i'd also look around based on your disk controller and ensure 
that there are no newer patches for it, just in case it's one for which 
there was a known problem. (which was worked around in the driver)

I *think* there was an issue with at least one or two...

Cheers!

Nathan.

Sandro wrote:
 hi folks
 
 I've been running my fileserver at home with linux for a couple of years and 
 last week I finally reinstalled it with solaris 10 u4.
 
 I borrowed a bunch of disks from a friend, copied over all the files, 
 reinstalled my fileserver and copied the data back.
 
 Everything went fine, but after a few days now, quite a lot of files got 
 corrupted.
 here's the output:
 
  # zpool status data
   pool: data
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008
 config:
 
 NAMESTATE READ WRITE CKSUM
 dataONLINE   0 0 5.52K
   raidz1ONLINE   0 0 5.52K
 c0t0d0  ONLINE   0 0 10.72
 c0t1d0  ONLINE   0 0 4.59K
 c0t2d0  ONLINE   0 0 5.18K
 c0t3d0  ONLINE   0 0 9.10K
 c1t0d0  ONLINE   0 0 7.64K
 c1t1d0  ONLINE   0 0 3.75K
 c1t2d0  ONLINE   0 0 4.39K
 c1t3d0  ONLINE   0 0 6.04K
 
 errors: 388 data errors, use '-v' for a list
 
 Last night I found out about this, it told me there were errors in like 50 
 files.
 So I scrubbed the whole pool and it found a lot more corrupted files.
 
 The temporary system which I used to hold the data while I'm installing 
 solaris on my fileserver is running nv build 80 and no errors on there.
 
 What could be the cause of these errors??
 I don't see any hw errors on my disks..
 
  # iostat -En | grep -i error
 c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t0d0   Soft Errors: 574 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t0d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t1d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t2d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c0t3d0   Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t1d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t2d0   Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 c1t3d0   Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 
 although a lot of soft errors.
 Linux said that one disk had gone bad, but I figured the sata cable was 
 somehow broken, so I replaced that before installing solaris. And solaris 
 didn't and doesn't see any actual hw errors on the disks, does it?
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-25 Thread Nathan Kroenert
And would drive storage requirements through the roof!!

I like it!

;)

Nathan.

Jonathan Loran wrote:
 
 David Magda wrote:
 On Feb 24, 2008, at 01:49, Jonathan Loran wrote:

 In some circles, CDP is big business. It would be a great ZFS offering.
 ZFS doesn't have it built-in, but AVS made be an option in some cases:

 http://opensolaris.org/os/project/avs/
 
 Point in time copy (as AVS offers) is not the same thing as CDP.  When 
 you snapshot data as in point in time copies, you predict the future, 
 knowing the time slice at which your data will be needed.  Continuous 
 data protection is based on the premise that you don't have a clue ahead 
 of time which point in time you want to recover to.  Essentially, for 
 CDP, you need to save every storage block that has ever been written, so 
 you can put them back in place if you so desire. 
 
 Anyone else on the list think it is worthwhile adding CDP to the ZFS 
 list of capabilities?  It causes space management issues, but it's an 
 interesting, useful idea.
 
 Jon
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes

2008-02-15 Thread Nathan Kroenert
What about new blocks written to an existing file?

Perhaps we could make that clearer in the manpage too...

hm.


Mattias Pantzare wrote:
  
   If you created them after, then no worries, but if I understand
   correctly, if the *file* was created with 128K recordsize, then it'll
   keep that forever...


 Files have nothing to do with it.  The recordsize is a file system
  parameter.  It gets a little more complicated because the recordsize
  is actually the maximum recordsize, not the minimum.
 
 Please read the manpage:
 
  Changing the file system's recordsize only affects files
  created afterward; existing files are unaffected.
 
 Nothing is rewritten in the file system when you change recordsize so
 is stays the same for existing files.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes

2008-02-15 Thread Nathan Kroenert
Hey, Richard -

I'm confused now.

My understanding was that any files created after the recordsize was set 
would use that as the new maximum recordsize, but files already created 
would continue to use the old recordsize.

Though I'm now a little hazy on what will happen when the new existing 
files are updated as well...

hm.

Cheers!

Nathan.

Richard Elling wrote:
 Nathan Kroenert wrote:
 And something I was told only recently - It makes a difference if you 
 created the file *before* you set the recordsize property.
 
 Actually, it has always been true for RAID-0, RAID-5, RAID-6.
 If your I/O strides over two sets then you end up doing more I/O,
 perhaps twice as much.
 

 If you created them after, then no worries, but if I understand 
 correctly, if the *file* was created with 128K recordsize, then it'll 
 keep that forever...
 
 Files have nothing to do with it.  The recordsize is a file system
 parameter.  It gets a little more complicated because the recordsize
 is actually the maximum recordsize, not the minimum.
 -- richard
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes

2008-02-14 Thread Nathan Kroenert
And something I was told only recently - It makes a difference if you 
created the file *before* you set the recordsize property.

If you created them after, then no worries, but if I understand 
correctly, if the *file* was created with 128K recordsize, then it'll 
keep that forever...

Assuming I understand correctly.

Hopefully someone else on the list will be able to confirm.

Cheers!

Nathan.

Richard Elling wrote:
 Anton B. Rang wrote:
 Create a pool [ ... ]
 Write a 100GB file to the filesystem [ ... ]
 Run I/O against that file, doing 100% random writes with an 8K block size.
 
 Did you set the record size of the filesystem to 8K?

 If not, each 8K write will first read 128K, then write 128K.
   
 
 Also check to see that your 8kByte random writes are aligned on 8kByte
 boundaries, otherwise you'll be doing a read-modify-write.
  -- richard
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS taking up to 80 seconds to flush a single 8KB O_SYNC block.

2008-02-06 Thread Nathan Kroenert
Hey all -

I'm working on an interesting issue where I'm seeing ZFS being quite 
cranky about writing O_SYNC written blocks.

Bottom line is that I have a small test case that does essentially this:

open file for writing  -- O_SYNC
loop(
write() 8KB of random data
print time taken to write data
}

It's taking anywhere up to 80 seconds per 8KB block. When the 'problem' 
is not in evidence, (and it's not always happening), I can do around 
1200 O_SYNC writes per second...

It seems to be waiting here virtually all of the time:

  0t11021::pid2proc | ::print proc_t p_tlist|::findstack -v
stack pointer for thread 30171352960: 2a118052df1
[ 02a118052df1 cv_wait+0x38() ]
   02a118052ea1 zil_commit+0x44(1, 6b50516, 193, 60005db66bc, 6b50570,
   60005db6640)
   02a118052f51 zfs_write+0x554(0, 14000, 2a1180539e8, 6000af22840, 
2000,
   2a1180539d8)
   02a118053071 fop_write+0x20(304898cd100, 2a1180539d8, 10, 
300a27a9e48, 0,
   7b7462d0)
   02a118053121 write+0x268(4, 8058, 60051a3d738, 2000, 113, 1)
   02a118053221 dtrace_systrace_syscall32+0xac(4, ffbfdaf0, 2000, 21e80,
   ff3a00c0, ff3a0100)
   02a1180532e1 syscall_trap32+0xcc(4, ffbfdaf0, 2000, 21e80, ff3a00c0,
   ff3a0100)

And this also evident in a dtrace of it, following the write in...

...
  28- zil_commit
  28  - cv_wait
  28- thread_lock
  28- thread_lock
  28- cv_block
  28  - ts_sleep
  28  - ts_sleep
  28  - new_mstate
  28- cpu_update_pct
  28  - cpu_grow
  28- cpu_decay
  28  - exp_x
  28  - exp_x
  28- cpu_decay
  28  - cpu_grow
  28- cpu_update_pct
  28  - new_mstate
  28  - disp_lock_enter_high
  28  - disp_lock_enter_high
  28  - disp_lock_exit_high
  28  - disp_lock_exit_high
  28- cv_block
  28- sleepq_insert
  28- sleepq_insert
  28- disp_lock_exit_nopreempt
  28- disp_lock_exit_nopreempt
  28- swtch
  28  - disp
  28- disp_lock_enter
  28- disp_lock_enter
  28- disp_lock_exit
  28- disp_lock_exit
  28- disp_getwork
  28- disp_getwork
  28- restore_mstate
  28- restore_mstate
  28  - disp
  28  - pg_cmt_load
  28  - pg_cmt_load
  28- swtch
  28- resume
  28  - savectx
  28- schedctl_save
  28- schedctl_save
  28  - savectx
...

At this point, it waits for up to 80 seconds.

I'm also seeing zil_commit() being called around 7-15 times per second.

For kicks, I disabled the ZIL: zil_disable/W0t1, and that made not a 
pinch of difference. :)

For what it's worth, this is a T2000, running Oracle, connected to an 
HDS 9990 (using 2GB fibre), with 8KB record sizes for the oracle 
filesystems, and I'm only seeing the issue on the ZFS filesystems that 
have the active oracle tables on them.

The O_SYNC test case is just trying to help me understand what's 
happening. The *real* problem is that oracle is running like rubbish 
when it's trying to roll forward archive logs from another server. It's 
an almost 100% write workload. At the moment, it cannot even keep up 
with the other server's log creation rate, and it's barely doing 
anything. (The other box is quite different, so not really valid for 
direct comparison at this point).

6513020 looked interesting for a while, but I already have 120011-14 and 
127111-03 and installed.

I'm looking into the cache flush settings of the 9990 array to see if 
it's that killing me, but I'm also looking for any other ideas on what 
might be hurting me.

I also have set
zfs:zfs_nocacheflush = 1
in /etc/system

The Oracle Logs are on a separate Zpool and I'm not seeing the issue on 
those filesystems.

The lockstats I have run are not yet all that interesting. If anyone has 
ideas on specific incantations I should use or some specific D or 
anything else, I'd be most appreciative.

Cheers!

Nathan.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun 5220 as a ZFS Server?

2008-02-05 Thread Nathan Kroenert
For what it's worth, I configured a T5220 this week with a 6 disk, three 
mirror zpool. (three top level mirror vdevs...).

Used only internal disks...

When pushing to disk, I was seeing bursts of 70 odd MB/s per spindle, 
with all 6 spindles making the 70MB/s, so 350MB/s ish.

Read performance was about the same for large files. (did not do 
anything with small files, though I expect that with the 2.5 SAS disks, 
it should be pretty good...).

I was not seeing a consistent 70MB/s per spindle, which I put down the 
the fact that I was only using a single thread to generate the writes. 
(A single thread of an N2 is only so fast... Just think of what you 
could do with 64 of them ;)

I'll be interested to see what the others have to say. :)

Hope this helps.

Nathan.





Michael Stalnaker wrote:
 We’re looking at building out sever ZFS servers, and are considering an 
 x86 platform vs a Sun 5520 as the base platform. Any comments from the 
 floor on comparative performance as a ZFS server? We’d be using the LSI 
 3801 controllers in either case.
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 30 seond hang, ls command....

2008-01-30 Thread Nathan Kroenert
Any chance the disks are being powered down, and you are waiting for 
them to power back up?

Nathan. :)

Neal Pollack wrote:
 I'm running Nevada build 81 on x86 on an Ultra 40.
 # uname -a
 SunOS zbit 5.11 snv_81 i86pc i386 i86pc
 Memory size: 8191 Megabytes
 
 I started with this zfs pool many dozens of builds ago, approx a year ago.
 I do live upgrade and zfs upgrade every few builds.
 
 When I have not accessed the zfs file systems for a long time,
 if I cd there and do an ls command, nothing happens for approx 30 seconds.
 
 Any clues how I would find out what is wrong?
 
 --
 
 # zpool status -v
   pool: tank
  state: ONLINE
  scrub: none requested
 config:
 
 NAMESTATE READ WRITE CKSUM
 tankONLINE   0 0 0
   raidz2ONLINE   0 0 0
 c2d0ONLINE   0 0 0
 c3d0ONLINE   0 0 0
 c4d0ONLINE   0 0 0
 c5d0ONLINE   0 0 0
 c6d0ONLINE   0 0 0
 c7d0ONLINE   0 0 0
 c8d0ONLINE   0 0 0
 
 errors: No known data errors
 
 
 # zfs list
 NAME   USED  AVAIL  REFER  MOUNTPOINT
 tank   172G  2.04T  52.3K  /tank
 tank/arc   172G  2.04T   172G  /zfs/arc
 
 # zpool list
 NAME   SIZE   USED  AVAILCAP  HEALTH  ALTROOT
 tank  3.16T   242G  2.92T 7%  ONLINE  -
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware for zfs home storage

2008-01-14 Thread Nathan Kroenert
I see a business opportunity for someone...

Backups for the masses... of Unix / VMS and other OS/s out there.

any takers? :)

Nathan.

Jonathan Loran wrote:
 
 
 eric kustarz wrote:
 On Jan 14, 2008, at 11:08 AM, Tim Cook wrote:

   
 www.mozy.com appears to have unlimited backups for 4.95 a month.   
 Hard to beat that.  And they're owned by EMC now so you know they  
 aren't going anywhere anytime soon.
 

 I just signed on and am trying Mozy out.  Note, its $5 per computer  
 and its *not* archival.  If you delete something on your computer,  
 then 30 days later it is not going to be backed up anymore.

 eric
   
 
 And they don't support Solaris or Linux, so that means I would have to 
 transfer everything indirectly from my Mac.  Or worse yet, run windoz in 
 a VM.  Hardly practical.  Why is it we always have to be second class 
 citizens!  Power to the (*x) people!
 
 Jon
 
 -- 
 
 
 - _/ _/  /   - Jonathan Loran -   -
 -/  /   /IT Manager   -
 -  _  /   _  / / Space Sciences Laboratory, UC Berkeley
 -/  / /  (510) 643-5146 [EMAIL PROTECTED]
 - __/__/__/   AST:7731^29u18e3
  
 
 
 
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing partition/label info

2007-12-17 Thread Nathan Kroenert
format -e

then from there, re-label using SMI label, versus EFI.

Cheers

Al Slater wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hi,
 
 What is the quickest way of clearing the label information on a disk
 that has been previously used in a zpool?
 
 regards
 
 - --
 Al Slater
 
 Technical Director
 SCL
 
 Phone : +44 (0)1273 07
 Fax   : +44 (0)1273 01
 email : [EMAIL PROTECTED]
 
 Stanton Consultancy Ltd
 Pavilion House, 6-7 Old Steine, Brighton, East Sussex, BN1 1EJ
 Registered in England Company number: 1957652 VAT number: GB 760 2433 55
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.7 (MingW32)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
 iD8DBQFHZoluz4fTOFL/EDYRAnr5AJ4ie+xFNCi6gA5HLZ8IqI1wHItEEwCgj0ru
 EwSc9B16io3kBz2wS0LGoEQ=
 =eaZc
 -END PGP SIGNATURE-
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-13 Thread Nathan Kroenert
This question triggered some silly questions in my mind:

Lots of folks are determined that the whole COW to different locations 
are a Bad Thing(tm), and in some cases, I guess it might actually be...

What if ZFS had a pool / filesystem property that caused zfs to do a 
journaled, but non-COW update so the data's relative location for 
databases is always the same?

Or - What if it did a double update: One to a staged area, and another 
immediately after that to the 'old' data blocks. Still always have 
on-disk consistency etc, at a cost of double the I/O's...

Of course, both of these would require non-sparse file creation for the 
DB etc, but would it be plausible?

For very read intensive and position sensitive applications, I guess 
this sort of capability might make a difference?

Just some stabs in the dark...

Cheers!

Nathan.


Louwtjie Burger wrote:
 Hi
 
 After a clean database load a database would (should?) look like this,
 if a random stab at the data is taken...
 
 [8KB-m][8KB-n][8KB-o][8KB-p]...
 
 The data should be fairly (100%) sequential in layout ... after some
 days though that same spot (using ZFS) would problably look like:
 
 [8KB-m][   ][8KB-o][   ]
 
 Is this pseudo logical-physical view correct (if blocks n and p was
 updated and with COW relocated somewhere else)?
 
 Could a utility be constructed to show the level of fragmentation ?
 (50% in above example)
 
 IF the above theory is flawed... how would fragmentation look/be
 observed/calculated under ZFS with large Oracle tablespaces?
 
 Does it even matter what the fragmentation is from a performance 
 perspective?
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS very slow under xVM

2007-11-01 Thread Nathan Kroenert
I observed something like this a while ago, but assumed it was something 
I did. (It usually is... ;)

Tell me - If you watch with an iostat -x 1, do you see bursts of I/O 
then periods of nothing, or just a slow stream of data?

I was seeing intermittent stoppages in I/O, with bursts of data on 
occasion...

Maybe it's not just me... Unfortunately, I'm still running old nv and 
xen bits, so I can't speak to the 'current' situation...

Cheers.

Nathan.

Martin wrote:
 Hello
 
 I've got Solaris Express Community Edition build 75 (75a) installed on an 
 Asus P5K-E/WiFI-AP (ip35/ICH9R based) board.  CPU=Q6700, RAM=8Gb, 
 disk=Samsung HD501LJ and (older) Maxtor 6H500F0.
 
 When the O/S is running on bare metal, ie no xVM/Xen hypervisor, then 
 everything is fine.
 
 When it's booted up running xVM and the hypervisor, then unlike plain disk 
 I/O, and unlike svm volumes, zfs is around 20 time slower.
 
 Specifically, with either a plain ufs on a raw/block disk device, or ufs on a 
 svn meta device, a command such as dd if=/dev/zero of=2g.5ish.dat bs=16k 
 count=15 takes less than a minute, with an I/O rate of around 30-50Mb/s.
 
 Similary, when running on bare metal, output to a zfs volume, as reported by 
 zpool iostat, shows a similar high output rate. (also takes less than a 
 minute to complete).
 
 But, when running under xVM and a hypervisor, although the ufs rates are 
 still good, the zfs rate drops after around 500Mb.
 
 For instance, if a window is left running zpool iostat 1 1000, then after the 
 dd command above has been run, there are about 7 lines showing a rate of 
 70Mbs, then the rate drops to around 2.5Mb/s until the entire file is 
 written.  Since the dd command initially completes and returns control back 
 to the shell in around 5 seconds, the 2 gig of data is cached and is being 
 written out.  It's similar with either the Samsung or Maxtor disks (though 
 the Samsung are slightly faster).
 
 Although previous releases running on bare metal (with xVM/Xen) have been 
 fine, the same problem exists with the earlier b66-0624-xen drop of Open 
 Solaris
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] characterizing I/O on a per zvol basis.

2007-10-17 Thread Nathan Kroenert
Hey all -

Time for my silly question of the day, and before I bust out vi and 
dtrace...

If there a simple, existing way I can observe the read / write / IOPS on 
a per-zvol basis?

If not, is there interest in having one?

Cheers!

Nathan.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?

2007-10-04 Thread Nathan Kroenert
I think it's a little more sinister than that...

I'm only just trying to import the pool. Not even yet doing any I/O to it...

Perhaps it's the same cause, I don't know...

But I'm certainly not convinced that I'd be happy with a 25K, for 
example, panicing just because I tried to import a dud pool...

I'm ok(ish) with the panic on a failed write to a non-redundant storage. 
I expect it by now...

Cheers!

Nathan.

Victor Engle wrote:
 Wouldn't this be the known feature where a write error to zfs forces a panic?
 
 Vic
 
 
 
 On 10/4/07, Ben Rockwood [EMAIL PROTECTED] wrote:
 Dick Davies wrote:
 On 04/10/2007, Nathan Kroenert [EMAIL PROTECTED] wrote:


 Client A
   - import pool make couple-o-changes

 Client B
   - import pool -f  (heh)


 Oct  4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80:
 Oct  4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion
 failed: dmu_read(os, smo-smo_object, offset, size, entry_map) == 0 (0x5
 == 0x0)
 , file: ../../common/fs/zfs/space_map.c, line: 339
 Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160
 genunix:assfail3+b9 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200
 zfs:space_map_load+2ef ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240
 zfs:metaslab_activate+66 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300
 zfs:metaslab_group_alloc+24e ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0
 zfs:metaslab_alloc_dva+192 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470
 zfs:metaslab_alloc+82 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0
 zfs:zio_dva_allocate+68 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0
 zfs:zio_next_stage+b3 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510
 zfs:zio_checksum_generate+6e ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530
 zfs:zio_next_stage+b3 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0
 zfs:zio_write_compress+239 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0
 zfs:zio_next_stage+b3 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610
 zfs:zio_wait_for_children+5d ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630
 zfs:zio_wait_children_ready+20 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650
 zfs:zio_next_stage_async+bb ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670
 zfs:zio_nowait+11 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960
 zfs:dbuf_sync_leaf+1ac ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0
 zfs:dbuf_sync_list+51 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10
 zfs:dnode_sync+23b ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50
 zfs:dmu_objset_sync_dnodes+55 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0
 zfs:dmu_objset_sync+13d ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40
 zfs:dsl_pool_sync+199 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0
 zfs:spa_sync+1c5 ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60
 zfs:txg_sync_thread+19a ()
 Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70
 unix:thread_start+8 ()
 Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]


 Is this a known issue, already fixed in a later build, or should I bug it?

 It shouldn't panic the machine, no. I'd raise a bug.


 After spending a little time playing with iscsi, I have to say it's
 almost inevitable that someone is going to do this by accident and panic
 a big box for what I see as no good reason. (though I'm happy to be
 educated... ;)

 You use ACLs and TPGT groups to ensure 2 hosts can't simultaneously
 access the same LUN by accident. You'd have the same problem with
 Fibre Channel SANs.

 I ran into similar problems when replicating via AVS.

 benr.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?

2007-10-04 Thread Nathan Kroenert
Erik -

Thanks for that, but I know the pool is corrupted - That was kind if the 
point of the exercise.

The bug (at least to me) is ZFS panicing Solaris just trying to import 
the dud pool.

But, maybe I'm missing your point?

Nathan.




eric kustarz wrote:

 Client A
   - import pool make couple-o-changes

 Client B
   - import pool -f  (heh)

 Client A + B - With both mounting the same pool, touched a couple of
 files, and removed a couple of files from each client

 Client A + B - zpool export

 Client A - Attempted import and dropped the panic.

 
 ZFS is not a clustered file system.  It cannot handle multiple readers 
 (or multiple writers).  By importing the pool on multiple machines, you 
 have corrupted the pool.
 
 You purposely did that by adding the '-f' option to 'zpool import'.  
 Without the '-f' option ZFS would have told you that its already 
 imported on another machine (A).
 
 There is no bug here (besides admin error :)  ).
 
 eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?

2007-10-04 Thread Nathan Kroenert
Awesome.

Thanks, Eric. :)

This type of feature / fix is quite important to a number of the guys in 
the our local OSUG. In particular, they are adamant that they cannot use 
ZFS in production until it stops panicing the whole box for isolated 
filesystem / zpool failures.

This will be a big step. :)

Cheers.

Nathan.

Eric Schrock wrote:
 On Fri, Oct 05, 2007 at 08:20:13AM +1000, Nathan Kroenert wrote:
 Erik -

 Thanks for that, but I know the pool is corrupted - That was kind if the 
 point of the exercise.

 The bug (at least to me) is ZFS panicing Solaris just trying to import 
 the dud pool.

 But, maybe I'm missing your point?

 Nathan.
 
 This a variation on the read error while writing problem.  It is a
 known issue and a generic solution (to handle any kind of non-replicated
 writes failing) is in the works (see PSARC 2007/567).
 
 - Eric
 
 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >