Re: [zfs-discuss] ZFS, power failures, and UPSes (and ZFS recovery guide links)

2009-07-01 Thread Haudy Kazemi

Ian Collins wrote:

David Magda wrote:

On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:

I have seen UPSs help quite a lot for short glitches lasting 
seconds, or a minute.  Otherwise the outage is usually longer than 
the UPSs can stay up since the problem required human attention.


A standby generator is needed for any long outages.


Can't remember where I read the claim, but supposedly if power isn't 
restored within about ten minutes, then it will probably be out for a 
few hours. If this 'statistic' is true, it would mean that your UPS 
should last (say) fifteen minutes, and after that you really need a 
generator.
Or run your systems of DC and get as much backup as you have room (and 
budget!) for batteries.  I once visited a central exchange with 48 
hours of battery capacity...


The way Google handles UPSes is to have a small 12v battery integrated 
with each PC power supply.  When the machine is on, the battery has its 
charged maintained.  Not unlike a laptop in that it has a built in 
battery backup, but using an inexpensive sealed lead acid battery 
instead of lithium ion.  Here is info along with photos of the Google 
server internals:

http://news.cnet.com/8301-1001_3-10209580-92.html
http://willysr.blogspot.com/2009/04/googles-server-design.html

(IIRC there have been power supply UPSes since at least the late 1980s 
which had an internal battery.  Either that or they were UPSes that fit 
inside the standard PC (AT) compatible desktop case, making the power 
protection system entirely internal to the computer.  I think I saw 
these models one time while browsing late 1980s or early 1990s issues of 
PC Magazine that reviewed UPSes.  They still exist...one company selling 
them is http://www.globtek.com/html/ups.html .  A Google search for 
'power supply built in UPS' would likely find more.)


I also did additional searches in the zfs-discuss archives and found a 
thread from mid-February, which lead me to other threads.  It looks like 
there are still scattered instances where ZFS has not recovered 
gracefully from power failures or other failures, where it became 
necessary to perform a manual transaction group (txg) rollback.  Here is 
a consolidated list of links related to manual uberblock transaction 
group (txg) rollback and similar ZFS data recovery guides, including 
undeleting:


Section 1: Nathan Hand's guide and related thread
Nathan Hand's guide to invalidating uberblocks (Dec 2008 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=85794
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg22153.html


Section 2. Victor Latushkin's guide and related threads
Thread: zpool unimportable (corrupt zpool metadata??) but no zdb -l 
device problems (Oct 2008 to Feb 2009 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=76960
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19839.html

Repair report: Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
http://www.opensolaris.org/jive/message.jspa?messageID=289537#289537

Some recovery discussion by Victor: zdb -bv alone took several hours to 
walk the block tree

http://www.opensolaris.org/jive/message.jspa?messageID=292991#292991
or 
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022365.html

or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20095.html

Victor Latushkin's guide: Thanks to COW nature of ZFS it was possible 
to successfully recover pool state which was only 5 seconds older than 
last unopenable one.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022331.html
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20061.html


Section 3: reliability debates, recovery tool planning, uberblock info
Thread: Availability: ZFS needs to handle disk removal / driver failure 
better (August 2008 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=70811
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19057.html

Thread: ZFS: unreliable for professional usage? (Feb 2009 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=91426
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23833.html

Richard Elling's post that uberblocks are kept in an 128-entry circular 
queue which is 4x redundant with 2 copies each at the beginning and end 
of the vdev. Other metadata, by default, is 2x redundant and spatially 
diverse.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg24145.html

Jeff Bonwick's post about Bug ID 6667683
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23961.html

Bug ID 6667683: need a way to rollback to an uberblock from a previous txg
Description: If we are unable to open the pool based on the most recent 
uberblock then it might be useful to try an older txg uberblock as it 
might provide a better view of the world. Having a utility to reset the 
uberblock to a previous txg might provide a nice recovery mechanism.


Re: [zfs-discuss] Q: zfs log device

2009-07-01 Thread Mark J Musante

On Tue, 30 Jun 2009, John Hoogerdijk wrote:

i've setup a RAIDZ2 pool with 5 SATA drives and added a 32GB SSD log 
device.  to see how well it works, i ran bonnie++, but never saw any 
io's on the log device (using iostat -nxce) .  pool status is good - no 
issues or errors.  any ideas?


Try using direct i/o (the -D flag) in bonnie++.  You'll need at least 
version 1.03e.



Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS tale of woe and fail

2009-07-01 Thread Victor Latushkin

On 19.01.09 12:09, Tom Bird wrote:

Toby Thain wrote:

On 18-Jan-09, at 6:12 PM, Nathan Kroenert wrote:


Hey, Tom -

Correct me if I'm wrong here, but it seems you are not allowing ZFS any
sort of redundancy to manage.


Every other file system out there runs fine on a single LUN, when things
go wrong you have a fsck utility that patches it up and the world keeps
on turning.

I can't find anywhere that will sell me a 48 drive SATA JBOD with all
the drives presented on a single SAS channel, so running on a single
giant LUN is a real world scenario that ZFS should be able to cope with,
as this is how the hardware I am stuck with is arranged.


Which is particularly catastrophic when one's 'content' is organized as
a monolithic file, as it is here - unless, of course, you have some way
of scavenging that file based on internal structure.


No, it's not a monolithic file, the point I was making there is that no
files are showing up.


r...@cs4:~# find /content
/content
r...@cs4:~# (yes that really is it)


This issue (and previous one reported by Tom) has got some publicity 
recently - see here


http://www.uknof.org.uk/uknof13/Bird-Redux.pdf

So i feel like i need to provide a little bit more information about the 
outcome (sorry that it is delayed and not as full as previous one).


First, it looked like this:


r...@cs4:~# zpool list
NAME  SIZE   USED  AVAILCAP  HEALTH  ALTROOT
content  62.5T  59.9T  2.63T95%  ONLINE  -

r...@cs4:~# zpool status -v
  pool: content
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
content ONLINE   0 032
  c2t8d0ONLINE   0 032

errors: Permanent errors have been detected in the following files:

content:0x0
content:0x2c898



First permanent error means that root block of the filesystem named 
'content' was corrupted (all copies), so it was not possible to open it 
and access any content of that filesystem.


Fortunately enough, there were not too much activity on the pool, so we 
decided to try previous states of the pool. I do not remember exact txg 
number we tried, but it was something like hundred txg back or so. We 
checked it with zdb and discovered that that state was more or less good 
- at least filesystem content was openable and it was possible to access 
its content, so we decided to reactivate that previous state. Pool 
imported fine and contents of 'content' was there. Subsequent scrub did 
find some errors but I do not remember exactly how much. Tom may have 
exact number.


Victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q: zfs log device

2009-07-01 Thread Bob Friesenhahn

On Wed, 1 Jul 2009, Mark J Musante wrote:


On Tue, 30 Jun 2009, John Hoogerdijk wrote:

i've setup a RAIDZ2 pool with 5 SATA drives and added a 32GB SSD log 
device.  to see how well it works, i ran bonnie++, but never saw any io's 
on the log device (using iostat -nxce) .  pool status is good - no issues 
or errors.  any ideas?


Try using direct i/o (the -D flag) in bonnie++.  You'll need at least version 
1.03e.


If this -D flag uses the Solaris directio() function, then it will do 
nothing for ZFS.  It only works for UFS and NFS.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q: zfs log device

2009-07-01 Thread Jason Ozolins

Mark J Musante wrote:

On Tue, 30 Jun 2009, John Hoogerdijk wrote:

i've setup a RAIDZ2 pool with 5 SATA drives and added a 32GB SSD log 
device.  to see how well it works, i ran bonnie++, but never saw any 
io's on the log device (using iostat -nxce) .  pool status is good - 
no issues or errors.  any ideas?


Try using direct i/o (the -D flag) in bonnie++.  You'll need at least 
version 1.03e.


Or you could export the filesystem via NFS and run any file creation/write 
workload on an NFS client; that should generate a large amount of log 
activity thanks to the synchronous writes that the NFS server must issue 
to honour its obligations to the NFS client.


--
jason.ozol...@anu.edu.au ANU Supercomputer Facility
 Leonard Huxley Bldg 56, Mills Road
Ph:  +61 2 6125 5449 Australian National University
Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS tale of woe and fail

2009-07-01 Thread David Magda

On Jul 1, 2009, at 12:37, Victor Latushkin wrote:

This issue (and previous one reported by Tom) has got some publicity  
recently - see here


http://www.uknof.org.uk/uknof13/Bird-Redux.pdf


Joyent also had issues a while back as well:

http://tinyurl.com/ytyzs6
http://www.joyeur.com/2008/01/22/bingodisk-and-strongspace-what-happened

A lot of people billed it as a ZFS issue, but it should be noted that  
because of all the checksuming going on, when you get back data you  
can be fairly sure that it hasn't been corrupted.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q: zfs log device

2009-07-01 Thread Jason Ozolins

John Hoogerdijk wrote:
so i guess there is some porting to do - no O_DIRECT in solaris... 


anyone have bonnie++ 1.03e ported already?


For your purposes, couldn't you replace O_DIRECT with O_SYNC as a hack? 
If you're trying to benchmark the log device, the important thing is to 
generate synchronous writes, and the zero-copy aspect of O_DIRECT is less 
important, no?

--
jason.ozol...@anu.edu.au ANU Supercomputer Facility
 Leonard Huxley Bldg 56, Mills Road
Ph:  +61 2 6125 5449 Australian National University
Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write I/O stalls

2009-07-01 Thread Marcelo Leal
 
 Note that this issue does not apply at all to NFS
 service, database 
 service, or any other usage which does synchronous
 writes.
 
 Bob
 Hello Bob,
 There is impact for all workloads.
 The fact that the write is sync or not, is just a question to write on slog 
(SSD) or not.
 But the txg interval and sync time is the same. Actually the zil code is just 
to preserve that exact same thing for synchronous writes.

 Leal
[ http://www.eall.com.br/blog ]
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backing up OS drive?

2009-07-01 Thread Tertius Lydgate
Hi cindys,

That recovery procedure seems overly complex.  I've instead purchased a disk to 
mirror my root pool onto.  Unfortunately, it seems that the disk is slightly 
smaller than my current rpool.  However I would be happy to have a mirror the 
same size as the smaller disk.  Is there a way to mirror onto a smaller disk, 
or alternately to send the root pool to the smaller disk, boot from it, then 
mirror to the larger one?

Thanks,

Lydgate
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rpool mirror on USB sticks

2009-07-01 Thread Tertius Lydgate
Did you ever figure this out?  I'm trying to do the same thing and also getting 
new device must be a single disk.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-07-01 Thread Andre van Eyssen

On Thu, 2 Jul 2009, Ian Collins wrote:


5+ is typical for telco use.


Aah, but we start getting into rooms full of giant 2V wet lead acid cells 
and giant busbars the size of railway tracks.


--
Andre van Eyssen.
mail: an...@purplecow.org  jabber: an...@interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix http://pix.purplecow.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS write I/O stalls

2009-07-01 Thread Zhu, Lejun
Actually it seems to be 3/4:

dsl_pool.c
391 zfs_write_limit_max = ptob(physmem)  
zfs_write_limit_shift;
392 zfs_write_limit_inflated = MAX(zfs_write_limit_min,
393 spa_get_asize(dp-dp_spa, zfs_write_limit_max));

While spa_get_asize is:

spa_misc.c
   1249 uint64_t
   1250 spa_get_asize(spa_t *spa, uint64_t lsize)
   1251 {
   1252 /*
   1253  * For now, the worst case is 512-byte RAID-Z blocks, in which
   1254  * case the space requirement is exactly 2x; so just assume 
that.
   1255  * Add to this the fact that we can have up to 3 DVAs per bp, 
and
   1256  * we have to multiply by a total of 6x.
   1257  */
   1258 return (lsize * 6);
   1259 }

Which will result in:
   zfs_write_limit_inflated = MAX((32  20), (ptob(physmem)  3) * 6);

Bob Friesenhahn wrote:
 Even if I set zfs_write_limit_override to 8053063680 I am unable to
 achieve the massive writes that Solaris 10 (141415-03) sends to my
 drive array by default.
 
 When I read the blog entry at
 http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle, I see this
 statement:
 
 The new code keeps track of the amount of data accepted in a TXG and
 the time it takes to sync. It dynamically adjusts that amount so that
 each TXG sync takes about 5 seconds (txg_time variable). It also
 clamps the limit to no more than 1/8th of physical memory.
 
 On my system I see that the about 5 seconds rule is being followed,
 but see no sign of clamping the limit to no more than 1/8th of
 physical memory.  There is no sign of clamping at all.  The writen
 data is captured and does take about 5 seconds to write (good
 estimate).
 
 On my system with 20GB of RAM, and ARC memory limit set to 10GB
 (zfs:zfs_arc_max = 0x28000), the maximum zfs_write_limit_override
 value I can set is on the order of 8053063680, yet this results in a
 much smaller amount of data being written per write cycle than the
 Solaris 10 default operation.  The default operation is 24 seconds of
 no write activity followed by 5 seconds of write.
 
 On my system, 1/8 of memory would be 2.5GB.  If I set the
 zfs_write_limit_override value to 2684354560 then it seems that about
 1.2 seconds of data is captured for write.  In this case I see 5
 seconds of no write followed by maybe a second of write.
 
 This causes me to believe that the algorithm is not implemented as
 described in Solaris 10.
 
 Bob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rpool mirror on USB sticks

2009-07-01 Thread Ian Collins

Tertius Lydgate wrote:
Did you ever figure this out?  

Figure what out?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss