Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread BM
On Fri, May 7, 2010 at 4:57 AM, Brandon High  wrote:
> I believe that the L2ARC behaves the same as a pool with multiple
> top-level vdevs. It's not typical striping, where every write goes to
> all devices. Writes may go to only one device, or may avoid a device
> entirely while using several other. The decision about where to place
> data is done at write time, so no fixed width stripes are created at
> allocation time.

That's nothing to believe or not to believe much.

Each write access to the L2ARC devices are grouped and sent
in-sequence. Queue is used to sort them out like to larger or fewer
chunks to write. L2ARC behaves in a rotor fashion, simply sweeping
writes through available space. That's all the magic, nothing very
special...

Answering to Mike's main question, behavior on failure is quite
simple: once some L2ARC device[s] gone, the others will continue to
function. Impact: a little performance losing, some time needs to warm
them up and sort things out. No serious consequences or data loss
here.

Take care, folks.

-- 
Kind regards, BM

Things, that are stupid at the beginning, rarely ends up wisely.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does ZFS use large memory pages?

2010-05-06 Thread Rob
Hi Gary,
I would not remove this line in /etc/system.
We have been combatting this bug for a while now on our ZFS file system running 
JES Commsuite 7. 

I would be interested in finding out how you were able to pin point the 
problem. 

We seem to have no worries with the system currently, but when the file system 
gets above 80% we seems to have quite a number of issues, much the same as what 
you've had in the past, ps and prstats hanging.

are you able to tell me the IDR number that you applied?

Thanks,
Rob
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski

On 06/05/2010 21:45, Nicolas Williams wrote:

On Thu, May 06, 2010 at 03:30:05PM -0500, Wes Felter wrote:
   

On 5/6/10 5:28 AM, Robert Milkowski wrote:

 

sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds.
   

Is there a way (short of DTrace) to write() some data and get
notified when the corresponding txg is committed? Think of it as a
poor man's group commit.
 

fsync(2) is it.  Of course, if you disable sync writes then there's no
way to find out for sure.  If you need to know when a write is durable,
then don't disable sync writes.

Nico
   
There is one way - issue a sync(2) - even with sync=disabled it will 
sync all filesystems and then return.

Another workaround would be to create a snapshot...

However I agree with Nico - if you don't need sync=disabled then don't 
use it.


Someone else mentioned that yet another option like sync=fsync-only 
would be useful so all would be async but fsync() - but frankly I'm not 
convinced as it would require a support in your application but at this 
point you already have a full control of the behavior without need for 
sync=disabled.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of the ZIL

2010-05-06 Thread Marc Moreau
On Tue, May 4, 2010 at 11:34 AM, Brandon High  wrote:

> On Tue, May 4, 2010 at 10:19 AM, Tony MacDoodle 
> wrote:
> > How would one determine if I should have a separate ZIL disk? We are
> using
> > ZFS as the backend of our Guest Domains boot drives using LDom's. And we
> are
> > seeing bad/very slow write performance?
>
> There's a dtrace script that Richard Elling wrote called zilstat.ksh.
> It's available at
> http://www.richardelling.com/Home/scripts-and-programs-1/zilstat
>
> I'm not sure what the numbers mean (there's info at the address) but
> anything other than lots of 0s indicates that the ZIL is being used.


On my workstation, I peg my IOPS when using VirtialBox set to run on zvols.
 The zilstat line comes back with about 3000 total synchronous writes per
30sec.  Which means that my disks are doing about 90 sync IOPS write.  That
is about the upper limit for 7200rpm disks ( from what I understand ).

This 3000 number doesn't really change much over time when running with IO
load.

Disabling the ZIL, I get much better performance, in terms of IO throughput.
 This tells me that the ZIL is the bottleneck.  I will be getting an SSD
soon.

-- Marc
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Richard Elling
On May 6, 2010, at 11:08 AM, Michael Sullivan wrote:
> Well, if you are striping over multiple devices the you I/O should be spread 
> over the devices and you should be reading them all simultaneously rather 
> than just accessing a single device.  Traditional striping would give 1/n 
> performance improvement rather than 1/1 where n is the number of disks the 
> stripe is spread across.

In theory, for bandwidth, yes, striping does improve by N.  For latency, 
striping
adds little, and in some cases is worse.  ZFS dynamic stripe tries to balance 
this 
tradeoff towards latency for HDDs by grouping blocks so that only one 
seek+rotate
is required. More below...

> The round-robin access I am referring to, is the way the L2ARC vdevs appear 
> to be accessed.  

RAID-0 striping is also round-robin.

> So, any given object will be taken from a single device rather than from 
> several devices simultaneously, thereby increasing the I/O throughput.  So, 
> theoretically, a stripe spread over 4 disks would give 4 times the 
> performance as opposed to reading from a single disk.  This also assumes the 
> controller can handle multiple I/O as well or that you are striped over 
> different disk controllers for each disk in the stripe.

All modern controllers handled multiple, concurrent I/O.

> SSD's are fast, but if I can read a block from more devices simultaneously, 
> it will cut the latency of the overall read.

OTOH, if you have to wait for N HDDs to seek+rotate, then the latency is that 
of the
slowest disk.  The classic analogy is: nine women cannot produce a baby in one 
month.
The difference is:

ZFS dynamic stripe:
latency per I/O = fixed latency of one vdev + (size / min(media 
bandwidth, path bandwidth))

RAID-0:
latency per I/O = max(fixed latency of devices) + (size / min((media 
bandwidth / N), path bandwidth))

For HDDs, the media bandwidth is around 100 MB/sec for many devices, far less 
than the
path bandwidth on a modern system.  For many SSDs, the media bandwidth is close 
to the 
path bandwidth. Newer SSDs have media bandwidth > 3Gbps, but 6Gbps SAS is 
becoming
readily available. In other words, if the path bandwidth isn't  a problem, and 
the media 
bandwidth of an SSD is 3x that of a HDD, then the bandwidth requirement that 
dictated 
RAID-0 for HDDs is reduced by a factor of 3. Yet another reason why HDDs lost 
the 
performance battle.

This is also why not many folks choose to use HDDs for L2ARC -- the latency 
gain over
the pool is marginal for HDDs.

This is also one reason why there is no concatenation in ZFS.
 -- richard

-- 
ZFS storage and performance consulting at http://www.RichardElling.com










___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup ration for iscsi-shared zfs dataset

2010-05-06 Thread Cindy Swearingen

Hi--

Even though the dedup property can be set on a file system basis,
dedup space usage is accounted for from the pool level by using
zpool list command.

My non-expert opinion is that it would be near impossible to report
space usage for dedup and non-dedup file systems at the file system
level.

More details are in the ZFS Dedup FAQ:

http://hub.opensolaris.org/bin/view/Community+Group+zfs/dedup

Thanks,

Cindy

On 05/06/10 12:31, eXeC001er wrote:

Hi.

How can i get this info?

Thanks.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Nicolas Williams
On Thu, May 06, 2010 at 03:30:05PM -0500, Wes Felter wrote:
> On 5/6/10 5:28 AM, Robert Milkowski wrote:
> 
> >sync=disabled
> >Synchronous requests are disabled. File system transactions
> >only commit to stable storage on the next DMU transaction group
> >commit which can be many seconds.
> 
> Is there a way (short of DTrace) to write() some data and get
> notified when the corresponding txg is committed? Think of it as a
> poor man's group commit.

fsync(2) is it.  Of course, if you disable sync writes then there's no
way to find out for sure.  If you need to know when a write is durable,
then don't disable sync writes.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Wes Felter

On 5/6/10 5:28 AM, Robert Milkowski wrote:


sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds.


Is there a way (short of DTrace) to write() some data and get notified 
when the corresponding txg is committed? Think of it as a poor man's 
group commit.


Wes Felter

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why both dedup and compression?

2010-05-06 Thread Erik Trimble
On Fri, 2010-05-07 at 03:10 +0900, Michael Sullivan wrote:
> This is interesting, but what about iSCSI volumes for virtual machines?
> 
> Compress or de-dupe?  Assuming the virtual machine was made from a clone of 
> the original iSCSI or a master iSCSI volume.
> 
> Does anyone have any real world data this?  I would think the iSCSI volumes 
> would diverge quite a bit over time even with compression and/or 
> de-duplication.
> 
> Just curious…
> 

VM OS storage is an ideal candidate for dedup, and NOT compression (for
the most part).

VM images contain large quantities of executable files, most of which
compress poorly, if at all. However, having 20 copies of the same
Window2003 VM image makes for very nice dedup.



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Robert Milkowski

On 06/05/2010 19:08, Michael Sullivan wrote:

Hi Marc,

Well, if you are striping over multiple devices the you I/O should be 
spread over the devices and you should be reading them all 
simultaneously rather than just accessing a single device. 
 Traditional striping would give 1/n performance improvement rather 
than 1/1 where n is the number of disks the stripe is spread across.


The round-robin access I am referring to, is the way the L2ARC vdevs 
appear to be accessed.  So, any given object will be taken from a 
single device rather than from several devices simultaneously, thereby 
increasing the I/O throughput.  So, theoretically, a stripe spread 
over 4 disks would give 4 times the performance as opposed to reading 
from a single disk.  This also assumes the controller can handle 
multiple I/O as well or that you are striped over different disk 
controllers for each disk in the stripe.


SSD's are fast, but if I can read a block from more devices 
simultaneously, it will cut the latency of the overall read.




Keep in mind that the largest block is currently 128KB and you always 
need to read an entire block.
Splitting a block across several L2ARC devices would probably decrease 
performance and would invalidate all blocks if only a single l2arc 
device would die. Additionally having each block only on one l2arc 
device allows to read from all of l2arc devices at the same time.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedup ration for iscsi-shared zfs dataset

2010-05-06 Thread Brandon High
On Thu, May 6, 2010 at 11:31 AM, eXeC001er  wrote:
> How can i get this info?

$ man zpool

$ zpool list
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
rpool   111G  15.5G  95.5G13%  1.00x  ONLINE  -
tank   7.25T  3.16T  4.09T43%  1.12x  ONLINE  -
$ zpool get dedupratio tank
NAME  PROPERTYVALUE  SOURCE
tank  dedupratio  1.12x  -

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Brandon High
On Thu, May 6, 2010 at 11:08 AM, Michael Sullivan
 wrote:
> The round-robin access I am referring to, is the way the L2ARC vdevs appear
> to be accessed.  So, any given object will be taken from a single device
> rather than from several devices simultaneously, thereby increasing the I/O
> throughput.  So, theoretically, a stripe spread over 4 disks would give 4

I believe that the L2ARC behaves the same as a pool with multiple
top-level vdevs. It's not typical striping, where every write goes to
all devices. Writes may go to only one device, or may avoid a device
entirely while using several other. The decision about where to place
data is done at write time, so no fixed width stripes are created at
allocation time.

In your example, if the file had at least four blocks there is a
likelihood that it will be spread across the four top-level vdevs.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practice for full stystem backup - equivelent of ufsdump/ufsrestore

2010-05-06 Thread Cindy Swearingen

Hi Bob,

You can review the latest Solaris 10 and OpenSolaris release dates here:

http://www.oracle.com/ocom/groups/public/@ocom/documents/webcontent/059542.pdf

Solaris 10 release, CY2010
OpenSolaris release, 1st half CY2010

Thanks,

Cindy

On 05/05/10 18:03, Bob Friesenhahn wrote:

On Wed, 5 May 2010, Ray Van Dolson wrote:


From a zfs standpoint, Solaris 10 does not seem to be behind the
currently supported OpenSolaris release.


Well, being able to remove ZIL devices is one important feature
missing.  Hopefully in U9. :)


While the development versions of OpenSolaris are clearly well beyond 
Solaris 10, I don't believe that the supported version of OpenSolaris (a 
year old already) has this feature yet either and Solaris 10 has been 
released several times since then already.  When the forthcoming 
OpenSolaris release emerges in 2011, the situation will be far 
different.  Solaris 10 can then play catch-up with the release of U9 in 
2012.


Bob

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Bob Friesenhahn

On Fri, 7 May 2010, Michael Sullivan wrote:


Well, if you are striping over multiple devices the you I/O should be spread 
over the devices and you
should be reading them all simultaneously rather than just accessing a single 
device.  Traditional
striping would give 1/n performance improvement rather than 1/1 where n is the 
number of disks the
stripe is spread across.


This is true.  Use of mirroring also improves performance since a 
mirror multiplies the read performance for the same data.  The value 
of the various approaches likely depends on the total size of the 
working set and the number of simultaneous requests.


Currently available L2ARC SSD devices are very good with a high number 
of I/Os, but they are quite a bottleneck for bulk reads as compared 
to L1ARC in RAM .


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] dedup ration for iscsi-shared zfs dataset

2010-05-06 Thread eXeC001er
Hi.

How can i get this info?

Thanks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Giovanni Tirloni
On Thu, May 6, 2010 at 1:18 AM, Edward Ned Harvey wrote:

> > From the information I've been reading about the loss of a ZIL device,
> What the heck?  Didn't I just answer that question?
> I know I said this is answered in ZFS Best Practices Guide.
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa
> rate_Log_Devices
>
> Prior to pool version 19, if you have an unmirrored log device that fails,
> your whole pool is permanently lost.
> Prior to pool version 19, mirroring the log device is highly recommended.
> In pool version 19 or greater, if an unmirrored log device fails during
> operation, the system reverts to the default behavior, using blocks from
> the
> main storage pool for the ZIL, just as if the log device had been
> gracefully
> removed via the "zpool remove" command.
>


This week I've had a bad experience replacing a SSD device that was in a
hardware RAID-1 volume. While rebuilding, the source SSD failed and the
volume was brought off-line by the controller.

The server kept working just fine but seemed to have switched from the
30-second interval to all writes going directly to the disks. I could
confirm this with iostat.

We've had some compatibility issues between LSI MegaRAID cards and a few
MTRON SSDs and I didn't believe the SSD had really died. So I brought it
off-line and back on-line and everything started to work.

ZFS showed the log device c3t1d0 as removed. After the RAID-1 volume was
back I replaced that device with itself and a resilver process started. I
don't know what it was resilvering against but it took 2h10min. I should
have probably tried a zpool offline/online too.

So I think if a log device fails AND you've to import your pool later
(server rebooted, etc)... then you lost your pool (prior to version 19).
Right ?

This happened on OpenSolaris 2009.6.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why both dedup and compression?

2010-05-06 Thread Michael Sullivan
This is interesting, but what about iSCSI volumes for virtual machines?

Compress or de-dupe?  Assuming the virtual machine was made from a clone of the 
original iSCSI or a master iSCSI volume.

Does anyone have any real world data this?  I would think the iSCSI volumes 
would diverge quite a bit over time even with compression and/or de-duplication.

Just curious…

On 6 May 2010, at 16:39 , Peter Tribble wrote:

> On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel  wrote:
>> I've googled this for a bit, but can't seem to find the answer.
>> 
>> What does compression bring to the party that dedupe doesn't cover already?
> 
> Compression will reduce the storage requirements for non-duplicate data.
> 
> As an example, I have a system that I rsync the web application data
> from a whole
> bunch of servers (zones) to. There's a fair amount of duplication in
> the application
> files (java, tomcat, apache, and the like) so dedup is a big win. On
> the other hand,
> there's essentially no duplication whatsoever in the log files, which
> are pretty big,
> but compress really well. So having both enabled works really well.
> 
> -- 
> -Peter Tribble
> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Michael Sullivan
Hi Marc,

Well, if you are striping over multiple devices the you I/O should be spread 
over the devices and you should be reading them all simultaneously rather than 
just accessing a single device.  Traditional striping would give 1/n 
performance improvement rather than 1/1 where n is the number of disks the 
stripe is spread across.

The round-robin access I am referring to, is the way the L2ARC vdevs appear to 
be accessed.  So, any given object will be taken from a single device rather 
than from several devices simultaneously, thereby increasing the I/O 
throughput.  So, theoretically, a stripe spread over 4 disks would give 4 times 
the performance as opposed to reading from a single disk.  This also assumes 
the controller can handle multiple I/O as well or that you are striped over 
different disk controllers for each disk in the stripe.

SSD's are fast, but if I can read a block from more devices simultaneously, it 
will cut the latency of the overall read.

On 7 May 2010, at 02:57 , Marc Nicholas wrote:

> Hi Michael,
> 
> What makes you think striping the SSDs would be faster than round-robin?
> 
> -marc
> 
> On Thu, May 6, 2010 at 1:09 PM, Michael Sullivan  
> wrote:
> Everyone,
> 
> Thanks for the help.  I really appreciate it.
> 
> Well, I actually walked through the source code with an associate today and 
> we found out how things work by looking at the code.
> 
> It appears that L2ARC is just assigned in round-robin fashion.  If a device 
> goes offline, then it goes to the next and marks that one as offline.  The 
> failure to retrieve the requested object is treated like a cache miss and 
> everything goes along its merry way, as far as we can tell.
> 
> I would have hoped it to be different in some way.  Like if the L2ARC was 
> striped for performance reasons, that would be really cool and using that 
> device as an extension of the VM model it is modeled after.  Which would mean 
> using the L2ARC as an extension of the virtual address space and striping it 
> to make it more efficient.  Way cool.  If it took out the bad device and 
> reconfigured the stripe device, that would be even way cooler.  Replacing it 
> with a hot spare more cool too.  However, it appears from the source code 
> that the L2ARC is just a (sort of) jumbled collection of ZFS objects.  Yes, 
> it gives you better performance if you have it, but it doesn't really use it 
> in a way you might expect something as cool as ZFS does.
> 
> I understand why it is read only, and it invalidates it's cache when a write 
> occurs, to be expected for any object written.
> 
> If an object is not there because of a failure or because it has been removed 
> from the cache, it is treated as a cache miss, all well and good - go fetch 
> from the pool.
> 
> I also understand why the ZIL is important and that it should be mirrored if 
> it is to be on a separate device.  Though I'm wondering how it is handled 
> internally when there is a failure of one of it's default devices, but then 
> again, it's on a regular pool and should be redundant enough, only just some 
> degradation in speed.
> 
> Breaking these devices out from their default locations is great for 
> performance, and I understand.  I just wish the knowledge of how they work 
> and their internal mechanisms were not so much of a black box.  Maybe that is 
> due to the speed at which ZFS is progressing and the features it adds with 
> each subsequent release.
> 
> Overall, I am very impressed with ZFS, its flexibility and even more so, it's 
> breaking all the rules about how storage should be managed and I really like 
> it.  I have yet to see anything to come close in its approach to disk data 
> management.  Let's just hope it keeps moving forward, it is truly a unique 
> way to view disk storage.
> 
> Anyway, sorry for the ramble, but to everyone, thanks again for the answers.
> 
> Mike
> 
> ---
> Michael Sullivan
> michael.p.sulli...@me.com
> http://www.kamiogi.net/
> Japan Mobile: +81-80-3202-2599
> US Phone: +1-561-283-2034
> 
> On 7 May 2010, at 00:00 , Robert Milkowski wrote:
> 
> > On 06/05/2010 15:31, Tomas Ögren wrote:
> >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:
> >>
> >>
> >>> On Wed, 5 May 2010, Edward Ned Harvey wrote:
> >>>
>  In the L2ARC (cache) there is no ability to mirror, because cache device
>  removal has always been supported.  You can't mirror a cache device, 
>  because
>  you don't need it.
> 
> >>> How do you know that I don't need it?  The ability seems useful to me.
> >>>
> >> The gain is quite minimal.. If the first device fails (which doesn't
> >> happen too often I hope), then it will be read from the normal pool once
> >> and then stored in ARC/L2ARC again. It just behaves like a cache miss
> >> for that specific block... If this happens often enough to become a
> >> performance problem, then you should throw away that L2ARC device
> >> because it's broken beyond usability.
> >>
> >>
> >
> > 

Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Marc Nicholas
Hi Michael,

What makes you think striping the SSDs would be faster than round-robin?

-marc

On Thu, May 6, 2010 at 1:09 PM, Michael Sullivan  wrote:

> Everyone,
>
> Thanks for the help.  I really appreciate it.
>
> Well, I actually walked through the source code with an associate today and
> we found out how things work by looking at the code.
>
> It appears that L2ARC is just assigned in round-robin fashion.  If a device
> goes offline, then it goes to the next and marks that one as offline.  The
> failure to retrieve the requested object is treated like a cache miss and
> everything goes along its merry way, as far as we can tell.
>
> I would have hoped it to be different in some way.  Like if the L2ARC was
> striped for performance reasons, that would be really cool and using that
> device as an extension of the VM model it is modeled after.  Which would
> mean using the L2ARC as an extension of the virtual address space and
> striping it to make it more efficient.  Way cool.  If it took out the bad
> device and reconfigured the stripe device, that would be even way cooler.
>  Replacing it with a hot spare more cool too.  However, it appears from the
> source code that the L2ARC is just a (sort of) jumbled collection of ZFS
> objects.  Yes, it gives you better performance if you have it, but it
> doesn't really use it in a way you might expect something as cool as ZFS
> does.
>
> I understand why it is read only, and it invalidates it's cache when a
> write occurs, to be expected for any object written.
>
> If an object is not there because of a failure or because it has been
> removed from the cache, it is treated as a cache miss, all well and good -
> go fetch from the pool.
>
> I also understand why the ZIL is important and that it should be mirrored
> if it is to be on a separate device.  Though I'm wondering how it is handled
> internally when there is a failure of one of it's default devices, but then
> again, it's on a regular pool and should be redundant enough, only just some
> degradation in speed.
>
> Breaking these devices out from their default locations is great for
> performance, and I understand.  I just wish the knowledge of how they work
> and their internal mechanisms were not so much of a black box.  Maybe that
> is due to the speed at which ZFS is progressing and the features it adds
> with each subsequent release.
>
> Overall, I am very impressed with ZFS, its flexibility and even more so,
> it's breaking all the rules about how storage should be managed and I really
> like it.  I have yet to see anything to come close in its approach to disk
> data management.  Let's just hope it keeps moving forward, it is truly a
> unique way to view disk storage.
>
> Anyway, sorry for the ramble, but to everyone, thanks again for the
> answers.
>
> Mike
>
> ---
> Michael Sullivan
> michael.p.sulli...@me.com
> http://www.kamiogi.net/
> Japan Mobile: +81-80-3202-2599
> US Phone: +1-561-283-2034
>
> On 7 May 2010, at 00:00 , Robert Milkowski wrote:
>
> > On 06/05/2010 15:31, Tomas Ögren wrote:
> >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:
> >>
> >>
> >>> On Wed, 5 May 2010, Edward Ned Harvey wrote:
> >>>
>  In the L2ARC (cache) there is no ability to mirror, because cache
> device
>  removal has always been supported.  You can't mirror a cache device,
> because
>  you don't need it.
> 
> >>> How do you know that I don't need it?  The ability seems useful to me.
> >>>
> >> The gain is quite minimal.. If the first device fails (which doesn't
> >> happen too often I hope), then it will be read from the normal pool once
> >> and then stored in ARC/L2ARC again. It just behaves like a cache miss
> >> for that specific block... If this happens often enough to become a
> >> performance problem, then you should throw away that L2ARC device
> >> because it's broken beyond usability.
> >>
> >>
> >
> > Well if a L2ARC device fails there might be an unacceptable drop in
> delivered performance.
> > If it were mirrored than a drop usually would be much smaller or there
> could be no drop if a mirror had an option to read only from one side.
> >
> > Being able to mirror L2ARC might especially be useful once a persistent
> L2ARC is implemented as after a node restart or a resource failover in a
> cluster L2ARC will be kept warm. Then the only thing which might affect L2
> performance considerably would be a L2ARC device failure...
> >
> >
> > --
> > Robert Milkowski
> > http://milek.blogspot.com
> >
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/list

Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Michael Sullivan
Everyone,

Thanks for the help.  I really appreciate it.

Well, I actually walked through the source code with an associate today and we 
found out how things work by looking at the code.

It appears that L2ARC is just assigned in round-robin fashion.  If a device 
goes offline, then it goes to the next and marks that one as offline.  The 
failure to retrieve the requested object is treated like a cache miss and 
everything goes along its merry way, as far as we can tell.

I would have hoped it to be different in some way.  Like if the L2ARC was 
striped for performance reasons, that would be really cool and using that 
device as an extension of the VM model it is modeled after.  Which would mean 
using the L2ARC as an extension of the virtual address space and striping it to 
make it more efficient.  Way cool.  If it took out the bad device and 
reconfigured the stripe device, that would be even way cooler.  Replacing it 
with a hot spare more cool too.  However, it appears from the source code that 
the L2ARC is just a (sort of) jumbled collection of ZFS objects.  Yes, it gives 
you better performance if you have it, but it doesn't really use it in a way 
you might expect something as cool as ZFS does.

I understand why it is read only, and it invalidates it's cache when a write 
occurs, to be expected for any object written.

If an object is not there because of a failure or because it has been removed 
from the cache, it is treated as a cache miss, all well and good - go fetch 
from the pool.

I also understand why the ZIL is important and that it should be mirrored if it 
is to be on a separate device.  Though I'm wondering how it is handled 
internally when there is a failure of one of it's default devices, but then 
again, it's on a regular pool and should be redundant enough, only just some 
degradation in speed.

Breaking these devices out from their default locations is great for 
performance, and I understand.  I just wish the knowledge of how they work and 
their internal mechanisms were not so much of a black box.  Maybe that is due 
to the speed at which ZFS is progressing and the features it adds with each 
subsequent release.

Overall, I am very impressed with ZFS, its flexibility and even more so, it's 
breaking all the rules about how storage should be managed and I really like 
it.  I have yet to see anything to come close in its approach to disk data 
management.  Let's just hope it keeps moving forward, it is truly a unique way 
to view disk storage.

Anyway, sorry for the ramble, but to everyone, thanks again for the answers.

Mike

---
Michael Sullivan   
michael.p.sulli...@me.com
http://www.kamiogi.net/
Japan Mobile: +81-80-3202-2599
US Phone: +1-561-283-2034

On 7 May 2010, at 00:00 , Robert Milkowski wrote:

> On 06/05/2010 15:31, Tomas Ögren wrote:
>> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:
>> 
>>   
>>> On Wed, 5 May 2010, Edward Ned Harvey wrote:
>>> 
 In the L2ARC (cache) there is no ability to mirror, because cache device
 removal has always been supported.  You can't mirror a cache device, 
 because
 you don't need it.
   
>>> How do you know that I don't need it?  The ability seems useful to me.
>>> 
>> The gain is quite minimal.. If the first device fails (which doesn't
>> happen too often I hope), then it will be read from the normal pool once
>> and then stored in ARC/L2ARC again. It just behaves like a cache miss
>> for that specific block... If this happens often enough to become a
>> performance problem, then you should throw away that L2ARC device
>> because it's broken beyond usability.
>> 
>>   
> 
> Well if a L2ARC device fails there might be an unacceptable drop in delivered 
> performance.
> If it were mirrored than a drop usually would be much smaller or there could 
> be no drop if a mirror had an option to read only from one side.
> 
> Being able to mirror L2ARC might especially be useful once a persistent L2ARC 
> is implemented as after a node restart or a resource failover in a cluster 
> L2ARC will be kept warm. Then the only thing which might affect L2 
> performance considerably would be a L2ARC device failure...
> 
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Brandon High
On Wed, May 5, 2010 at 8:47 PM, Michael Sullivan
 wrote:
> While it explains how to implement these, there is no information regarding 
> failure of a device in a striped L2ARC set of SSD's.  I have been hard 
> pressed to find this information anywhere, short of testing it myself, but I 
> don't have the necessary hardware in a lab to test correctly.  If someone has 
> pointers to references, could you please provide them to chapter and verse, 
> rather than the advice to "Go read the manual."

Yes, but the answer is in the man page. So reading it is a good idea:

"If a read error is encountered on a cache device, that read I/O is
reissued to the original storage pool  device,  which  might be part
of a mirrored or raidz configuration."

> I'm running 2009.11 which is the latest OpenSolaris.  I should have made that 
> clear, and that I don't intend this to be on Solaris 10 system, and am 
> waiting for the next production build anyway.  As you say, it does not exist 
> in 2009.06, this is not the latest production Opensolaris which is 2009.11, 
> and I'd be more interested in its behavior than an older release.

The "latest" is b134, which contains many, many fixes over 2009.11,
though it's a dev release.

> From the information I've been reading about the loss of a ZIL device, it 
> will be relocated to the storage pool it is assigned to.  I'm not sure which 
> version this is in, but it would be nice if someone could provide the release 
> number it is included in (and actually works), it would be nice.  Also, will 
> this functionality be included in the mythical 2010.03 release?

It's went into somewhere around b118 I think, so it will be in the
next scheduled release.

> Also, I'd be interested to know what features along these lines will be 
> available in 2010.03 if it ever sees the light of day.

Look at the latest dev release. b134 was originally slated to be
2010.03, so the feature set of the final release should be very close.

> So what you are saying is that if a single device fails in a striped L2ARC 
> VDEV, then the entire VDEV is taken offline and the fallback is to simply use 
> the regular ARC and fetch from the pool whenever there is a cache miss.

The strict interpretation of the documentation is that the read is
re-issued. My understanding is that the block that failed to be read
would then be read from the original pool.

> Or, does what you are saying here mean that if I have a 4 SSD's in a stripe 
> for my L2ARC, and one device fails, the L2ARC will be reconfigured 
> dynamically using the remaining SSD's for L2ARC.

Auto-healing in zfs would resilver the block that failed to be read,
either onto the same device or another cache device in the pool,
exactly as if a read failed on a normal pool device. It wouldn't
reconfigure the cache devices, but each failed read would cause the
blocks to be reallocated to a functioning device which has the same
effect in the end.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Robert Milkowski

On 06/05/2010 15:31, Tomas Ögren wrote:

On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:

   

On Wed, 5 May 2010, Edward Ned Harvey wrote:
 

In the L2ARC (cache) there is no ability to mirror, because cache device
removal has always been supported.  You can't mirror a cache device, because
you don't need it.
   

How do you know that I don't need it?  The ability seems useful to me.
 

The gain is quite minimal.. If the first device fails (which doesn't
happen too often I hope), then it will be read from the normal pool once
and then stored in ARC/L2ARC again. It just behaves like a cache miss
for that specific block... If this happens often enough to become a
performance problem, then you should throw away that L2ARC device
because it's broken beyond usability.

   


Well if a L2ARC device fails there might be an unacceptable drop in 
delivered performance.
If it were mirrored than a drop usually would be much smaller or there 
could be no drop if a mirror had an option to read only from one side.


Being able to mirror L2ARC might especially be useful once a persistent 
L2ARC is implemented as after a node restart or a resource failover in a 
cluster L2ARC will be kept warm. Then the only thing which might affect 
L2 performance considerably would be a L2ARC device failure...



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS - USB 3.0 SSD disk

2010-05-06 Thread Bruno Sousa
Hi all,

It seems like the market has yet another type of ssd device, this time a
USB 3.0 portable SSD device by OCZ.
Going on the specs it seems to me that if this device has a good price
it might be quite useful for caching purposes on ZFS based storage.
Take a look at
http://www.ocztechnology.com/products/solid-state-drives/usb-3-0-/ocz-enyo-usb-3-0-portable-solid-state-drive.html
 
and we need to wait for prices and for systems with USB 3.0 :)

P.S PCIE 3.0 is near by ..iupi!

Bruno


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Tomas Ögren
On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:

> On Wed, 5 May 2010, Edward Ned Harvey wrote:
>>
>> In the L2ARC (cache) there is no ability to mirror, because cache device
>> removal has always been supported.  You can't mirror a cache device, because
>> you don't need it.
>
> How do you know that I don't need it?  The ability seems useful to me.

The gain is quite minimal.. If the first device fails (which doesn't
happen too often I hope), then it will be read from the normal pool once
and then stored in ARC/L2ARC again. It just behaves like a cache miss
for that specific block... If this happens often enough to become a
performance problem, then you should throw away that L2ARC device
because it's broken beyond usability.

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Bob Friesenhahn

On Wed, 5 May 2010, Edward Ned Harvey wrote:


In the L2ARC (cache) there is no ability to mirror, because cache device
removal has always been supported.  You can't mirror a cache device, because
you don't need it.


How do you know that I don't need it?  The ability seems useful to me.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of the ZIL

2010-05-06 Thread Ross Walker
On May 6, 2010, at 8:34 AM, Edward Ned Harvey   
wrote:



From: Pasi Kärkkäinen [mailto:pa...@iki.fi]


In neither case do you have data or filesystem corruption.



ZFS probably is still OK, since it's designed to handle this (?),
but the data can't be OK if you lose 30 secs of writes.. 30 secs of
writes
that have been ack'd being done to the servers/applications..


What I meant was:  Yes there's data loss.  But no corruption.  In  
other
filesystems, if you have an ungraceful shutdown while the filesystem  
is
writing, since filesystems such as EXT3 perform file-based (or inode- 
based)
block write operations, then you can have files whose contents have  
been
corrupted...  Some sectors of the file still in their "old" state,  
and some
sectors of the file in their "new" state.  Likewise, in something  
like EXT3,

you could have some file fully written, while another one hasn't been
written yet, but should have been.  (AKA, some files written out of  
order.)


In the case of EXT3, since it is a journaled filesystem, the journal  
only
keeps the *filesystem* consistent after a crash.  It's still  
possible to

have corrupted data in the middle of a file.


I believe ext3 has an option to journal data as well as metadata, it  
just defaults to metadata.


I don't believe out-of-order writes are so much an issue any more  
since Linux gained write barrier support (and most file systems and  
block devices now support it).



These things don't happen in ZFS.  ZFS takes journaling to a whole new
level.  Instead of just keeping your filesystem consistent, it also  
keeps
your data consistent.  Yes, data loss is possible when a system  
crashes, but
the filesystem will never have any corruption.  These are separate  
things

now, and never were before.


ZFS does NOT have a journal, it has an intent log which is completely  
different. A journal logs operations that are to be performed later  
(the journal is read, the operation performed) an intent log logs  
operations that are being performed now, when the disk flushes the  
intent entry is marked complete.


ZFS is consistent by the nature of COW which means a partial write  
will not become part of the file system (the old block pointer isn't  
updated till the new block completes the write).


In ZFS, losing n-seconds of writes leading up to the crash will  
never result
in files partially written, or written out of order.  Every atomic  
write to
the filesystem results in a filesystem-consistent and data- 
consistent view

of *some* valid form of all the filesystem and data within it.


ZFS file system will always be consistent, but if an application  
doesn't flush it's data, then it can definitely have partially written  
data.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of the ZIL

2010-05-06 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Ragnar Sundblad
> 
> But if you have an application, protocol and/or user that demands
> or expects persistant storage, disabling ZIL of course could be fatal
> in case of a crash. Examples are mail servers and NFS servers.

Basically, anything which writes to disk based on requests from something
across a network.  Because if your system goes down and comes back up,
thinking itself is consistent, but there's one client thinking "A" and
another client thinking "B" ... even though your server is consistent, the
world isn't.

Another great example would be if your server handles credit card
transactions.  If a user clicks "buy now" in  a web interface, and the
server contacts Visa or MasterCard, records the transaction, and then
crashes before it records the transaction to its own disks ... Then the
server would come up and have no recollection of that transaction.  But the
user, and Visa/Mastercard certainly would remember it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of the ZIL

2010-05-06 Thread Edward Ned Harvey
> From: Pasi Kärkkäinen [mailto:pa...@iki.fi]
>
> > In neither case do you have data or filesystem corruption.
> >
> 
> ZFS probably is still OK, since it's designed to handle this (?),
> but the data can't be OK if you lose 30 secs of writes.. 30 secs of
> writes
> that have been ack'd being done to the servers/applications..

What I meant was:  Yes there's data loss.  But no corruption.  In other
filesystems, if you have an ungraceful shutdown while the filesystem is
writing, since filesystems such as EXT3 perform file-based (or inode-based)
block write operations, then you can have files whose contents have been
corrupted...  Some sectors of the file still in their "old" state, and some
sectors of the file in their "new" state.  Likewise, in something like EXT3,
you could have some file fully written, while another one hasn't been
written yet, but should have been.  (AKA, some files written out of order.)

In the case of EXT3, since it is a journaled filesystem, the journal only
keeps the *filesystem* consistent after a crash.  It's still possible to
have corrupted data in the middle of a file.

These things don't happen in ZFS.  ZFS takes journaling to a whole new
level.  Instead of just keeping your filesystem consistent, it also keeps
your data consistent.  Yes, data loss is possible when a system crashes, but
the filesystem will never have any corruption.  These are separate things
now, and never were before.

In ZFS, losing n-seconds of writes leading up to the crash will never result
in files partially written, or written out of order.  Every atomic write to
the filesystem results in a filesystem-consistent and data-consistent view
of *some* valid form of all the filesystem and data within it.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Pawel Jakub Dawidek
On Thu, May 06, 2010 at 01:15:41PM +0100, Robert Milkowski wrote:
> On 06/05/2010 13:12, Robert Milkowski wrote:
> >On 06/05/2010 12:24, Pawel Jakub Dawidek wrote:
> >>I read that this property is not inherited and I can't see why.
> >>If what I read is up-to-date, could you tell why?
> >
> >It is inherited. Sorry for the confusion but there was a discussion if 
> >it should or should not be inherited, then we propose that it 
> >shouldn't but it was changed again during a PSARC review that it should.
> >
> >And I did a copy'n'paste here.
> >
> >Again, sorry for the confusion.
> >
> Well, actually I did copy'n'paste a proper page as it doesn't say 
> anything about inheritance.
> 
> Nevertheless, yes it is inherited.

Yes, your e-mail didn't mention that and I wanted to clarify if what I
read in PSARC changed or not. Thanks:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp3bNocGiTgs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski

On 06/05/2010 13:12, Robert Milkowski wrote:

On 06/05/2010 12:24, Pawel Jakub Dawidek wrote:

I read that this property is not inherited and I can't see why.
If what I read is up-to-date, could you tell why?


It is inherited. Sorry for the confusion but there was a discussion if 
it should or should not be inherited, then we propose that it 
shouldn't but it was changed again during a PSARC review that it should.


And I did a copy'n'paste here.

Again, sorry for the confusion.

Well, actually I did copy'n'paste a proper page as it doesn't say 
anything about inheritance.


Nevertheless, yes it is inherited.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski

On 06/05/2010 12:24, Pawel Jakub Dawidek wrote:

I read that this property is not inherited and I can't see why.
If what I read is up-to-date, could you tell why?
   


It is inherited. Sorry for the confusion but there was a discussion if 
it should or should not be inherited, then we propose that it shouldn't 
but it was changed again during a PSARC review that it should.


And I did a copy'n'paste here.

Again, sorry for the confusion.

--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] image-update doesn't work anymore (bootfs not supported on EFI)

2010-05-06 Thread Christian Thalinger
On Wed, 2010-05-05 at 09:45 -0600, Evan Layton wrote:
> No that doesn't appear like an EFI label. So it appears that ZFS
> is seeing something there that it's interpreting as an EFI label.
> Then the command to set the bootfs property is failing due to that.
> 
> To restate the problem the BE can't be activated because we can't set
> the bootfs property of the root pool and even the ZFS command to set
> it fails with "property 'bootfs' not supported on EFI labeled devices"
> 
> for example the following command:
> # zfs set bootfs=rpool/ROOT/opensolaris rpool
> 
> fails with that same error message.

I guess you mean zpool, but yes:

# zpool set bootfs=rpool/ROOT/opensolaris-138 rpool
cannot set property for 'rpool': property 'bootfs' not supported on EFI labeled 
devices

> 
> Do you have any of the older BEs like build 134 that you can boot back
> to and see if those will allow you to set the bootfs property on the
> root pool? It's just really strange that out of nowhere it started
> thinking that the device is EFI labeled.

I have a couple of BEs I could boot to:

$ beadm list
BE  Active Mountpoint Space   Policy Created  
--  -- -- -   -- ---  
opensolaris -  -  1.00G   static 2009-10-01 08:00 
opensolaris-124 -  -  20.95M  static 2009-10-03 13:30 
opensolaris-125 -  -  30.00M  static 2009-10-17 15:18 
opensolaris-126 -  -  25.33M  static 2009-10-29 20:18 
opensolaris-127 -  -  1.37G   static 2009-11-14 13:20 
opensolaris-128 -  -  1.91G   static 2009-12-04 14:28 
opensolaris-129 -  -  22.49M  static 2009-12-12 11:31 
opensolaris-130 -  -  21.64M  static 2009-12-26 19:46 
opensolaris-131 -  -  24.72M  static 2010-01-22 22:51 
opensolaris-132 -  -  57.32M  static 2010-02-09 23:05 
opensolaris-133 -  -  1.07G   static 2010-02-20 12:55 
opensolaris-134 N  /  43.17G  static 2010-03-08 21:58 
opensolaris-138 R  -  1.81G   static 2010-05-04 12:03 

I will try on 132 or 133.  Get back to you later.

-- Christian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another MPT issue - kernel crash

2010-05-06 Thread James C. McPherson

On  5/05/10 10:42 PM, Bruno Sousa wrote:

Hi all,

I have faced yet another kernel panic that seems to be related to mpt
driver.
This time i was trying to add a new disk to a running system (snv_134)
and this new disk was not being detected...following a tip i ran the
lsitool to reset the bus and this lead to a system panic.

MPT driver : BAD TRAP: type=e (#pf Page fault) rp=ff001fc98020
addr=4 occurred in module "mpt" due to a NULL pointer dereference

If someone has a similar problem it might be worthwhile to expose it
here or to add information to the filled bug , available at
https://defect.opensolaris.org/bz/show_bug.cgi?id=15879



That's an already-known CR, tracked in Bugster. I've
updated defect.o.o and transferred your info to the
Bugster CR, 6895862. Until the nightly inside->outside
bugs.o.o sync up it'll still show up as closed, but
don't worry, I've re-opened it.


James C. McPherson
--
Senior Software Engineer, Solaris
Oracle
http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Darren J Moffat

On 06/05/2010 12:24, Pawel Jakub Dawidek wrote:

I read that this property is not inherited and I can't see why.
If what I read is up-to-date, could you tell why?


It is inherited, this changed as a result of the PSARC review.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Pawel Jakub Dawidek
On Thu, May 06, 2010 at 11:28:37AM +0100, Robert Milkowski wrote:
> With the put back of:
> 
> [PSARC/2010/108] zil synchronicity
> 
> zfs datasets now have a new 'sync' property to control synchronous 
> behaviour.
> The zil_disable tunable to turn synchronous requests into asynchronous
> requests (disable the ZIL) has been removed. For systems that use that 
> switch on upgrade
> you will now see a message on booting:
> 
>   sorry, variable 'zil_disable' is not defined in the 'zfs' module
> 
> Please update your system to use the new sync property.
> Here is a summary of the property:
> 
> ---
> 
> The options and semantics for the zfs sync property:
> 
> sync=standard
>This is the default option. Synchronous file system transactions
>(fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
>and then secondly all devices written are flushed to ensure
>the data is stable (not cached by device controllers).
> 
> sync=always
>For the ultra-cautious, every file system transaction is
>written and flushed to stable storage by system call return.
>This obviously has a big performance penalty.
> 
> sync=disabled
>Synchronous requests are disabled.  File system transactions
>only commit to stable storage on the next DMU transaction group
>commit which can be many seconds.  This option gives the
>highest performance, with no risk of corrupting the pool.
>However, it is very dangerous as ZFS is ignoring the synchronous
> transaction
>demands of applications such as databases or NFS.
>Setting sync=disabled on the currently active root or /var
>file system may result in out-of-spec behavior or application data
>loss and increased vulnerability to replay attacks.
>Administrators should only use this when these risks are understood.
> 
> The property can be set when the dataset is created, or dynamically,
> and will take effect immediately.  To change the property, an
> administrator can use the standard 'zfs' command.  For example:
> 
> # zfs create -o sync=disabled whirlpool/milek
> # zfs set sync=always whirlpool/perrin

I read that this property is not inherited and I can't see why.
If what I read is up-to-date, could you tell why?

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpnwVhYvicjy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes

2010-05-06 Thread Matt Keenan


Based on comments, some people say nay, some say yah. so I decided 
to give it a spin, and see

how I get on.

To make my mirror bootable I followed instructions posted here :
  http://www.taiter.com/blog/2009/04/opensolaris-200811-adding-disk.html

I plan to do a quick write up myself of my own experience, but so far 
everything is working fine.


Mirror size is 200GB (Smallest disk, happens to be laptop disk), once I 
attached the USB drive, it
started resilvering straight away, and only took 1hr 45mins to complete 
and it resilvered 120G !!

This I was very impressed with.

So far I've not noticed any system performance degradation with the 
drive attached. I did a quick test, yanked out the drive, degrades rpool 
as expected, but system continues to function fine.


I also did a quick test to see of the USB drive was indeed bootable, by 
connecting to another laptop, it booted perfectly.


Connecting the USB drive back to original laptop, the pool comes back 
online and resilvers seamlessly.


This is automatic 24/7 backup at it's best...

One thing I did notice, I powered down yesterday whilst USB was 
attached, this morning when booting up, I did so without the USB 
attached, laptop failed to boot, I had to connect the USB drive and it 
booted up fine.


Key would be to degrade the pool before shutdown, e.g. disconnect USB 
drive, might try using zpool offline and see how that works.


If I encounter issues, I'll post again.

cheers

Matt

On 05/ 5/10 09:34 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Matt Keenan

Just wondering whether mirroring a USB drive with main laptop disk for
backup purposes is recommended or not.

Plan would be to connect the USB drive, once or twice a week, let it
resilver, and then disconnect again. Connecting USB drive 24/7 would
AFAIK have performance issues for the Laptop.
 

MMmmm...  If it works, sounds good.  But I don't think it'll work as
expected, for a number of reasons, outlined below.

The suggestion I would have instead, would be to make the external drive its
own separate zpool, and then you can incrementally "zfs send | zfs receive"
onto the external.

Here are the obstacles I think you'll have with your proposed solution:

#1 I think all the entire used portion of the filesystem needs to resilver
every time.  I don't think there's any such thing as an incremental
resilver.

#2 How would you plan to disconnect the drive?  If you zpool detach it, I
think it's no longer a mirror, and not mountable.  If you simply yank out
the plug ... although that might work, it would certainly be nonideal.  If
you power off, disconnect, and power on ... Again, it should probably be
fine, but it's not designed to be used that way intentionally, so your
results ... are probably as-yet untested.

I don't want to go on.  This list could go on forever.  I will strongly
encourage you to simply use "zfs send | zfs receive" because that's a
standard practice thing to do.  It is known that the external drive is not
bootable this way, but that's why you have this article on how to make it
bootable:

http://docs.sun.com/app/docs/doc/819-5461/ghzur?l=en&a=view


   

This would have the added benefit of the USB drive being bootable.
 

By default, AFAIK, that's not correct.  When you mirror rpool to another
device, by default the 2nd device is not bootable, because it's just got an
rpool in there.  No boot loader.

Even if you do this mirror idea, which I believe will be slower and less
reliable than "zfs send | zfs receive" you still haven't gained anything as
compared to the "zfs send | zfs receive" procedure, which is known to work
reliable with optimal performance.

   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does Opensolaris support thin reclamation?

2010-05-06 Thread Andras Spitzer
Please find this thread for further info about this topic : 

http://www.opensolaris.org/jive/thread.jspa?threadID=120824&start=0&tstart=0

In short, ZFS doesn't support thin reclamation today, although we have RFE open 
to implement it somewhere in the future.

Regards,
sendai
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski

With the put back of:

[PSARC/2010/108] zil synchronicity

zfs datasets now have a new 'sync' property to control synchronous behaviour.
The zil_disable tunable to turn synchronous requests into asynchronous
requests (disable the ZIL) has been removed. For systems that use that switch 
on upgrade
you will now see a message on booting:

  sorry, variable 'zil_disable' is not defined in the 'zfs' module

Please update your system to use the new sync property.
Here is a summary of the property:

---

The options and semantics for the zfs sync property:

sync=standard
   This is the default option. Synchronous file system transactions
   (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
   and then secondly all devices written are flushed to ensure
   the data is stable (not cached by device controllers).

sync=always
   For the ultra-cautious, every file system transaction is
   written and flushed to stable storage by system call return.
   This obviously has a big performance penalty.

sync=disabled
   Synchronous requests are disabled.  File system transactions
   only commit to stable storage on the next DMU transaction group
   commit which can be many seconds.  This option gives the
   highest performance, with no risk of corrupting the pool.
   However, it is very dangerous as ZFS is ignoring the synchronous
transaction
   demands of applications such as databases or NFS.
   Setting sync=disabled on the currently active root or /var
   file system may result in out-of-spec behavior or application data
   loss and increased vulnerability to replay attacks.
   Administrators should only use this when these risks are understood.

The property can be set when the dataset is created, or dynamically,
and will take effect immediately.  To change the property, an
administrator can use the standard 'zfs' command.  For example:

# zfs create -o sync=disabled whirlpool/milek
# zfs set sync=always whirlpool/perrin



-- Team ZIL.

It should be in build 140.
For a little bit more information on it you might look at 
http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot create snapshot: dataset is busy

2010-05-06 Thread Brandon High
On Thu, May 6, 2010 at 1:31 AM, Brandon High  wrote:
> Any other way to fix it? There's no data in the zvol that I can't
> easily reproduce if it needs to be destroyed.

I did a rollback to the most recent snapshot, which seems to have worked.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] cannot create snapshot: dataset is busy

2010-05-06 Thread Brandon High
I'm unable to snapshot a dataset, receiving the error "dataset is
busy". Google and some bug reports suggest it's from a zil that hasn't
been completely replayed, and that mounting and unmounting the dataset
will fix it. Which is great, except it's a zvol.

Any other way to fix it? There's no data in the zvol that I can't
easily reproduce if it needs to be destroyed.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of the ZIL

2010-05-06 Thread Ragnar Sundblad

On 6 maj 2010, at 08.17, Pasi Kärkkäinen wrote:

> On Wed, May 05, 2010 at 11:32:23PM -0400, Edward Ned Harvey wrote:
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Robert Milkowski
>>> 
>>> if you can disable ZIL and compare the performance to when it is off it
>>> will give you an estimate of what's the absolute maximum performance
>>> increase (if any) by having a dedicated ZIL device.
>> 
>> I'll second this suggestion.  It'll cost you nothing to disable the ZIL
>> temporarily.  (You have to dismount the filesystem twice.  Once to disable
>> the ZIL, and once to re-enable it.)  Then you can see if performance is
>> good.  If performance is good, then you'll know you need to accelerate your
>> ZIL.  (Because disabled ZIL is the fastest thing you could possibly ever
>> do.)
>> 
>> Generally speaking, you should not disable your ZIL for the long run.  But
>> in some cases, it makes sense.
>> 
>> Here's how you determine if you want to disable your ZIL permanently:
>> 
>> First, understand that with the ZIL disabled, all sync writes are treated as
>> async writes.  This is buffered in ram before being written to disk, so the
>> kernel can optimize and aggregate the write operations into one big chunk.
>> 
>> No matter what, if you have an ungraceful system shutdown, you will lose all
>> the async writes that were waiting in ram.
>> 
>> If you have ZIL disabled, you will also lose the sync writes that were
>> waiting in ram (because those are being handled as async.)
>> 
>> In neither case do you have data or filesystem corruption.
>> 
> 
> ZFS probably is still OK, since it's designed to handle this (?),
> but the data can't be OK if you lose 30 secs of writes.. 30 secs of writes
> that have been ack'd being done to the servers/applications..

Entirely right!

This is the case for many local user writes anyway, since many
applications doesn't sync the written data to disk.

But if you have an application, protocol and/or user that demands
or expects persistant storage, disabling ZIL of course could be fatal
in case of a crash. Examples are mail servers and NFS servers.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why both dedup and compression?

2010-05-06 Thread Peter Tribble
On Thu, May 6, 2010 at 2:06 AM, Richard Jahnel  wrote:
> I've googled this for a bit, but can't seem to find the answer.
>
> What does compression bring to the party that dedupe doesn't cover already?

Compression will reduce the storage requirements for non-duplicate data.

As an example, I have a system that I rsync the web application data
from a whole
bunch of servers (zones) to. There's a fair amount of duplication in
the application
files (java, tomcat, apache, and the like) so dedup is a big win. On
the other hand,
there's essentially no duplication whatsoever in the log files, which
are pretty big,
but compress really well. So having both enabled works really well.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss