Re: [zfs-discuss] Solaris vs FreeBSD question

2011-05-20 Thread Frank Van Damme
Op 20-05-11 01:17, Chris Forgeron schreef:
 I ended up switching back to FreeBSD after using Solaris for some time 
 because I was getting tired of weird pool corruptions and the like.

Did you ever manage to recover the data you blogged about on Sunday,
February 6, 2011?

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faulted Pool Question

2011-05-20 Thread Paul Kraus
On Fri, May 20, 2011 at 12:53 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On May 19, 2011, at 2:09 PM, Paul Kraus p...@kraus-haus.org wrote:

    Is there a way (other than zpool online) to kick ZFS into
 rescanning the LUNs ?

 zpool clear poolname

I am unclear on when clear is the right command vs online. I have
not gotten consistent information from Oracle. Can Richard (or someone
else) please summarize here, thanks.

snip

    If I had realized the entire 3511 array had gone away and that we
 would be restarting it, I would NOT have attempted to replace the
 faulted LUN and we would probably be OK.

 yes

Yeah, hindsight and all that. But at the moment I hit return on
the zpool replace we still only had one of three trays faulted on the
3511 ... sigh.

snip

 P.S. The other zpools on the box are still up and running. The ones
 that had deviceson the faulted 3511 are degraded but online, the ones
 that did not have devices on the faulted 3511 are OK. Because of these
 other zpools we can't really reboot the box or pull the FC
 connections.

 Reboot isn't needed, this isn't a PeeCee :-)

Oracle support recommended a reboot (which did clear the ZFS
issue). I was not at the office to try to get a better solution out of
Oracle.

Now this morning, the original tray in the 3511 that failed is
offline again, but  this time it is not the bug we have run into,
but a genuine failure of more than one drive in a RAID set. So now I
am zpool replacing the faulted LUNs (and have asked that no one reboot
any 3511's until I am done :-)

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris vs FreeBSD question

2011-05-20 Thread Chris Forgeron

Original Message-
From: Frank Van Damme
Sent: Friday, May 20, 2011 6:25 AM

Op 20-05-11 01:17, Chris Forgeron schreef:
 I ended up switching back to FreeBSD after using Solaris for some time 
 because I was getting tired of weird pool corruptions and the like.

Did you ever manage to recover the data you blogged about on Sunday, February 
6, 2011?

Oh yes, I didn't follow up on that. I'll have to that now.. here's the recap. 

Yes, I did get most of it back, thanks to a lot of effort from George Wilson 
(great guy, and I'm very indebted to him) .  However, any data that was in play 
at the time of the fault was irreversibly damaged and couldn't be restored. Any 
data that wasn't active at the time of the crash was perfectly fine, it just 
needed to be copied out of the pool into a new pool. George had to mount my 
pool for me, as it was beyond non-ZFS-programmer skills to mount. Unfortunately 
Solaris would dump after about 24 hours, requiring a second mounting by George. 
It was also slower than cold molasses to copy anything in it's faulted state. 
If I was getting 1 Meg/Sec, I was lucky. You can imaging that creates an issue 
when you're trying to evacuate a few TB of data through a slow pipe like that. 

After it dumped again, I didn't bother George for a third remounting (or I 
tried very half-heartedly, the guy was already into this for a lot of time, and 
we all have our day jobs), and abandoned the data that was still stranded on 
the faulted pool. I copied my most wanted data first, so what I abandoned was a 
personal collection of movies that I could always re-rip. 


I was still experimenting with ZFS at the time, so I wasn't using snapshots for 
backup, just conventional image backups of the VM's that were running.  
Snapshots would have had a good chance of protecting my data from the fault 
that I ran into. 


I was originally blaming my Areca 1880 card, as I was working with Areca tech 
support on a more stable driver for Solaris, and was on the 3rd revision of a 
driver with them. However, in the end it wasn't the Areca, as I was very 
familiar with it's tricks - The Areca would hang (about once every day or two), 
but it wouldn't take out the pool.  After removing the Arcea and going with 
just LSI 2008 based controllers,  I had one final fault  about 3 weeks later 
that corrupted another pool (luckily it was just a backup pool). At that point, 
the swearing in the server room reached a peak, I booted back into FreeBSD, and 
haven't looked back.  Originally when I used the Areca controller with FreeBSD, 
I didn't have any problems for about 2 months. 

I've had only small FreeBSD issues since then, nothing else has changed on my 
hardware. So the only claim I can make is that in my environment, on my 
hardware, I've had better stability with FreeBSD. 

One of the speed slow-downs with FreeBSD from my comparison tests was the 
O_SYNC method that ESX uses to mount a NFS store. I edited the FreeBSD NFS 
source to always do a async write, regardless of the O_SYNC from the client, 
and that perked FreeBSD up a lot for speed, making it fairly close to what I 
was getting on Solaris.  FreeBSD is now using a 4.1 NFS server by default as of 
the last month, and I'm just starting my stability tests with using a new 
FreeBSD-9 build to see if I can run newer code. I'll do speed tests again, and 
will probably make the same hack to the 4.1 NFS code to force async writes.  
I'll post to my blog and the FreeBSD lists when that occurs, as it's out of 
scope for this list. 

I do like Solaris - After some initial discomfort about the different way 
things were being done, I do see the overall design and idea, and I now have a 
wish list of features I'd like see ported to FreeBSD. I think I'll have a 
Solaris based box setup again for testing.  We'll see what time allows. 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Is Dedup processing parallelized?

2011-05-20 Thread Jim Klimov

Hi all,

On my oi_148a system I'm now in the process of evacuating
data from my dcpool (an iSCSI device with a ZFS pool inside),
which is hosted in my physical pool on harddisks (6-disk
raidz2). The dcpool was configured to dedup all data inside
it, and the volume pool/dcpool was compressed as to separate
the two processes. I decided to scrap this experiment, and
now I'm copying back my data by reading files from dcpool
and writing it back into compressed+deduped datasets in pool.

I often see two interesting conditions in this setup:

1) The process is rather slow (I think due to dedup involved -
   even though, by my calculations, the whole DDT can fit in
   my 8Gb RAM), however the kernel processing time often peaks
   out at close to 50%, and there is often quite a bit of idle
   time. I have a dual-core box, so it makes sense to believe
   that some system cycle is not using more than one core.

   Does anyone know if DDT tree walk or search for available
   block ranges in metaslabs or whatever lengthy cycles there
   can be - if any of these are done in a sequential fashion?

   Below is my current DDT sizing. I still do not know which
   value to trust as the DDT entry size in RAM - the one
   returned by MDB or by ZDB (otherwise - what are those
   in-core and on-disk values? I've asked before but got
   no replies...)

# zdb -D -e 1601233584937321596
DDT-sha256-zap-ditto: 68 entries, size 1807 on disk, 240 in core
DDT-sha256-zap-duplicate: 1970815 entries, size 1134 on disk, 183 in core
DDT-sha256-zap-unique: 4376290 entries, size 1158 on disk, 187 in core

dedup = 1.38, compress = 1.07, copies = 1.01, dedup * compress / copies 
= 1.46


# zdb -D -e dcpool
DDT-sha256-zap-ditto: 388 entries, size 380 on disk, 200 in core
DDT-sha256-zap-duplicate: 5421787 entries, size 311 on disk, 176 in core
DDT-sha256-zap-unique: 16841361 entries, size 284 on disk, 145 in core

dedup = 1.34, compress = 1.00, copies = 1.00, dedup * compress / copies 
= 1.34


# echo ::sizeof ddt_entry_t | mdb -k
sizeof (ddt_entry_t) = 0x178

   Since I'm writing to pool (queried by GUID number above),
   my box's performance primarily depends on its DDT - I guess.
   In worst case that's 6.4mil entries times 376 bytes = 2.4Gb,
   which is well below my computer's 8Gb RAM (and fits the ARC
   metadata report below).

   However the dcpool's current DDT is clearly big, about
   23mil entries * 376 bytes = 8.6Gb.

2) As seen below, the ARC including metadata currently takes up 3.7Gb.
   According to prstat, all of the global zone processes use 180Mb.
   ZFS is the only filesystem on this box.
   So the second question is: Who uses the other 4Gb of system RAM?

   This picture occurs consistently on every system uptime, as long
   as I use the pool for reading and/or writing extensively, and it
   seems that this is some sort of kernel buffering or workspace
   memory or whatever (cached metaslab allocation tables, maybe?),
   and it is not part of ARC - but it is even bigger.

   What is it? Can it be controlled (as to not decrease performance
   when ARC and/or DDT need more RAM) or at least queried?

# ./tuning/arc_summary.pl | egrep -v 'mdb|set zfs:' | head -18 | grep : 
; echo ::arc | mdb -k | grep meta_

 Physical RAM:  8183 MB
 Free Memory :  993 MB
 LotsFree:  127 MB
 Current Size: 3705 MB (arcsize)
 Target Size (Adaptive):   3705 MB (c)
 Min Size (Hard Limit):3072 MB (zfs_arc_min)
 Max Size (Hard Limit):6656 MB (zfs_arc_max)
 Most Recently Used Cache Size:  90%3342 MB (p)
 Most Frequently Used Cache Size: 9%362 MB (c-p)
arc_meta_used =  2617 MB
arc_meta_limit=  6144 MB
arc_meta_max  =  4787 MB

Thanks for any insights,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-20 Thread Sašo Kiselkov
On 05/19/2011 07:47 PM, Richard Elling wrote:
 On May 19, 2011, at 5:35 AM, Sašo Kiselkov wrote:
 
 Hi all,

 I'd like to ask whether there is a way to monitor disk seeks. I have an
 application where many concurrent readers (50) sequentially read a
 large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
 read/write ops using iostat, but that doesn't tell me how contiguous the
 data is, i.e. when iostat reports 500 read ops, does that translate to
 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!
 
 In general, this is hard to see from the OS.  In Solaris, the default I/O
 flowing through sd gets sorted based on LBA before being sent to the
 disk. If the disks gets more than 1 concurrent I/O request (10 is the default
 for Solaris-based ZFS) then the disk can resort or otherwise try to optimize
 the media accesses.
 
 As others have mentioned, iopattern is useful for looking a sequential 
 patterns. I've made some adjustments for the version at
 http://www.richardelling.com/Home/scripts-and-programs-1/iopattern
 
 You can see low-level SCSI activity using scsi.d, but I usually uplevel that
 to using iosnoop -Dast which shows each I/O and its response time.
 Note that the I/Os can complete out-of-order on many devices. The only 
 device I know that is so fast and elegant that it always completes in-order 
 is the DDRdrive.
 
 For detailed analysis of iosnoop data, you will appreciate a real statistics
 package. I use JMP, but others have good luck with R.
  -- richard

Thank you, the iopattern script seems to be quite close to what I
wanted. The percentage split between random and sequential I/O is pretty
much what I needed to know.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-20 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 New problem:
 
 I'm following all the advice I summarized into the OP of this thread, and
 testing on a test system.  (A laptop).  And it's just not working.  I am
 jumping into the dedup performance abyss far, far eariler than
predicted...

(resending this message, because it doesn't seem to have been delivered the
first time.  If this is a repeat, please ignore.)

Now I'm repeating all these tests on a system that more closely resembles a
server.  This is a workstation with 6 core processor, 16G ram, and a single
1TB hard disk.

In the default configuration, arc_meta_limit is 3837MB.  And as I increase
the number of unique blocks in the data pool, it is perfectly clear that
performance jumps off a cliff when arc_meta_used starts to reach that level,
which is approx 880,000 to 1,030,000 unique blocks.  FWIW, this means,
without evil tuning, a 16G server is only sufficient to run dedup on approx
33GB to 125GB unique data without severe performance degradation.  I'm
calling severe degradation anything that's an order of magnitude or worse.
(That's 40K average block size * 880,000 unique blocks, and 128K average
block size * 1,030,000 unique blocks.)

So clearly this needs to be addressed, if dedup is going to be super-awesome
moving forward.

But I didn't quit there.

So then I tweak the arc_meta_limit.  Set to 7680MB.  And repeat the test.
This time, the edge of the cliff is not so clearly defined, something like
1,480,000 to 1,620,000 blocks.  But the problem is - arc_meta_used never
even comes close to 7680MB.  At all times, I still have at LEAST 2G unused
free mem.

I have 16G physical mem, but at all times, I always have at least 2G free.
my arcstats:c_max is 15G.  But my arc size never exceeds 8.7G
my arc_meta_limit is 7680 MB, but my arc_meta_used never exceeds 3647 MB.

So what's the holdup?

All of the above is, of course, just a summary.  If you want complete
overwhelming details, here they are:
http://dl.dropbox.com/u/543241/dedup%20tests/readme.txt

http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
http://dl.dropbox.com/u/543241/dedup%20tests/parse.py
http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass-parsed.xlsx

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass-parsed.xlsx


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is Dedup processing parallelized?

2011-05-20 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 1) The process is rather slow (I think due to dedup involved -
 even though, by my calculations, the whole DDT can fit in
 my 8Gb RAM), 

Please see:
http://opensolaris.org/jive/thread.jspa?messageID=516567

In particular:
 New problem:
 I'm following all the advice I summarized into the OP of this thread, and
[In other words, complete DDT fits in ram]
 testing on a test system. (A laptop). And it's just not working. I am
 jumping into the dedup performance abyss far, far eariler than
predicted...

and:  I have another post, which doesn't seem to have found its way to this
list.  So I just resent it.  Here's a snippet:

 This is a workstation with 6 core processor, 16G ram, and a single 1TB 
 hard disk.
 In the default configuration, arc_meta_limit is 3837MB.  And as I increase

 the number of unique blocks in the data pool, it is perfectly clear that 
 performance jumps off a cliff when arc_meta_used starts to reach that 
 level, which is approx 880,000 to 1,030,000 unique blocks.  FWIW, this 
 means, without evil tuning, a 16G server is only sufficient to run dedup 
 on approx 33GB to 125GB unique data without severe performance 
 degradation


 # zdb -D -e 1601233584937321596
 DDT-sha256-zap-ditto: 68 entries, size 1807 on disk, 240 in core
 DDT-sha256-zap-duplicate: 1970815 entries, size 1134 on disk, 183 in core
 DDT-sha256-zap-unique: 4376290 entries, size 1158 on disk, 187 in core
 
 dedup = 1.38, compress = 1.07, copies = 1.01, dedup * compress / copies
 = 1.46
 
 # zdb -D -e dcpool
 DDT-sha256-zap-ditto: 388 entries, size 380 on disk, 200 in core
 DDT-sha256-zap-duplicate: 5421787 entries, size 311 on disk, 176 in core
 DDT-sha256-zap-unique: 16841361 entries, size 284 on disk, 145 in core
 
 dedup = 1.34, compress = 1.00, copies = 1.00, dedup * compress / copies
 = 1.34
 
 # echo ::sizeof ddt_entry_t | mdb -k
 sizeof (ddt_entry_t) = 0x178

As you can see in that other thread, I am exploring dedup performance too,
and finding that this method of calculation is totally ineffective.  Number
of blocks times size of ddt_entry, as you have seen, produces a reasonable
number, but the experimentally measured results are nowhere near this.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Same device node appearing twice in same mirror; one faulted, one not...

2011-05-20 Thread Cindy Swearingen

Hi Alex

More scary than interesting to me.

What kind of hardware and which Solaris release?

Do you know what steps lead up to this problem? Any recent hardware
changes?

This output should tell you which disks were in this pool originally:

# zpool history tank

If the history identifies tank's actual disks, maybe you can determine
which disk is masquerading as c5t1d0.

If that doesn't work, accessing the individual disk entries in format
should tell which one is the problem, if its only one.

I would like to see the output of this command:

# zdb -l /dev/dsk/c5t1d0s0

Make sure you have a good backup of your data. If you need to pull a
disk to check cabling, or rule out controller issues, you should
probably export this pool first. Have a good backup.

Others have resolved minor device issues by exporting/importing the
pool but with format/zpool commands hanging on your system, I'm not
confident that this operation will work for you.

Thanks,

Cindy

On 05/19/11 12:17, Alex wrote:

I thought this was interesting - it looks like we have a failing drive in our 
mirror, but the two device nodes in the mirror are the same:

  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: scrub completed after 1h9m with 0 errors on Sat May 14 03:09:45 2011
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  mirror-0  DEGRADED 0 0 0
c5t1d0  ONLINE   0 0 0
c5t1d0  FAULTED  0 0 0  corrupted data

c5t1d0 does indeed only appear once in the format list. I wonder how to go 
about correcting this if I can't uniquely identify the failing drive.

format takes forever to spill its guts, and the zpool commands all hang.. 
clearly there is hardware error here, probably causing that, but not sure how to identify 
which disk to pull.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] New twist on the faulted zpools

2011-05-20 Thread Paul Kraus
I have run into a more serious and scary situation after our array
outage yesterday.

As I posted earlier today, I came in this morning and found 9 LUNs off
line (our of over 120). Not a big deal, as the rest of the array was
OK (and still is), and the other arrays are fine. Everything is
mirrored across arrays. I started zpool replaceing bad LUNs with
some excess capacity we have. The first two went fine, the third is
still resilvering. The fourth, on the other hand, has been a
nightmare. Here is the current state:

   pool: deadbeef
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: resilver in progress for 2h4m, 0.07% done, 3186h1m to go
config:

NAME STATE READ WRITE CKSUM
deadbeef   UNAVAIL  0
0 0  insufficient replicas
  mirror-0   DEGRADED 0 0 0
c5t600C0FF009278536638D9B07d0ONLINE   0 0 0
replacing-1  DEGRADED 0 0 0
  c5t600C0FF00922614781B19005d0  UNAVAIL  0
 0 0  corrupted data
  c5t600C0FF009277F7905F6DD05d0  ONLINE   0
 0 0  38K resilvered
  mirror-1   UNAVAIL  0
 0 0  corrupted data
c5t600C0FF00927852FB91AD301d0ONLINE   0 0 0
c5t600C0FF00922614781B19006d0ONLINE   0
 0 0  14K resilvered
  mirror-2   ONLINE   0 0 0
c5t600C0FF009277F6FA1A14C06d0ONLINE   0
 0 0  31K resilvered
c5t600015D60200B361d0ONLINE   0 0 0
  mirror-3   DEGRADED 0 0 0
replacing-0  DEGRADED 0 0 0
  c5t600C0FF0092261491D9A9F09d0  UNAVAIL  0
 0 0  cannot open
  c5t600015D60200B365d0  ONLINE   0
 0 0  32.9M resilvered
c5t600C0FF009277F7905F6DD02d0ONLINE   0
 0 0  2.50K resilvered

errors: 134 data errors, use '-v' for a list

Now, of all these UNAVAIL and FAULTed devices only one is actually
bad, c5t600C0FF0092261491D9A9F09d0 is from the raid set that
is dead. Now, when the array was cold booted yesterday there was a
temporary outage of the LUNs from the other two raidsets as well
(c5t600C0FF00922614781B19005d0 and
c5t600C0FF00922614781B19006d0). We have seen this before, and
usually we just do a 'zpool clear' of the device and a resilver gets
us back where we need to be.

This time has been different... I did a 'zpool clear deadbeef
c5t600C0FF00922614781B19005d0' and the zpool immediately went
UNAVAIL with c5t600C0FF009278536638D9B07d0 going UNAVAIL. I
did a 'zpool clear deadbeef c5t600C0FF009278536638D9B07d0' and
it came right back.

At that point I confirmed that I could read from both
c5t600C0FF009278536638D9B07d0 and
c5t600C0FF00922614781B19005d0 using dd. I also let the
resilver in progress complete, which it did in about an hour with no
issues.

I then did the zpool replace on
c5t600C0FF0092261491D9A9F09d0 in mirror-3 (the really dead
device) and I was rewarded with an UNAVAIL pool again. I cleared a
number of known good devices and the got the pool back.

   At this point I assumed the zfs label on the
c5t600C0FF00922614781B19005d0 had gotten somehow corrupted so
I tried a zpool replace of it with itself and even with -f it would
not let me. So I tried replacing it with a different LUN, as you can
see above. That was when it all went into the crapper and has stayed
there. zpool clear does not even return (and can't be killed).
mirror-1 reports UNAVAIL but both halves report ONLINE.

   I am afraid to EXPORT in case it won't IMPORT, but I have also
started the process to restore from the replicated copy of the data
from a remote site. After lunch I will probably try and EXPORT /
IMPORT and see if that gets me anywhere.

NOTE: there are 16 other pools on this server, one of which is
resilvering, one of which still has bad LUNs I need to replace, and
the rest are fine. The pool has a capacity of 1.5 TB and is about 1.37
TB used, the remaining pool to cleanup is 8 TB used out of 9 TB and we
really can't afford to have these kinds of problems with that one.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, RPI Players