date:20081016

Re: [zfs-discuss] Looking for some hardware answers, maybe someone on this list could help

2008-10-16 Thread mike

On Wed, Oct 15, 2008 at 9:13 PM, Al Hopper [EMAIL PROTECTED] wrote:

 The exception to the rule of multiple 12v output sections is PC
 Power  Cooling - who claim that there is no technical advantage to
 having multiple 12v outputs (and this feature is only a marketing
 gimmick).  But now that they have merged with OCZ - who always claimed
 that there are advantages to multiple 12v output sections ... I'm not
 sure where they stand today.  In any case the PC Power  Cooling PSUs
 are premium, reliable, high performance parts in my personal
 experience - altough their Silencer products are far from silent in
 my experience!  :)

it's good to have that vote of confidence as i picked that brand :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Looking for some hardware answers, maybe someone on this list could help

2008-10-16 Thread gm_sjo

 On Wed, Oct 15, 2008 at 9:13 PM, Al Hopper [EMAIL PROTECTED] wrote:
 The exception to the rule of multiple 12v output sections is PC
 Power  Cooling - who claim that there is no technical advantage to
 having multiple 12v outputs (and this feature is only a marketing
 gimmick).  But now that they have merged with OCZ - who always claimed
 that there are advantages to multiple 12v output sections ... I'm not
 sure where they stand today.  In any case the PC Power  Cooling PSUs
 are premium, reliable, high performance parts in my personal
 experience - altough their Silencer products are far from silent in
 my experience!  :)

Well that depends, you can build a power supply with multiple isolated
12V rails. I would hope this is what they mean when they specify
multiple 12V outputs with equal/different current/load ratings.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Ross

Well obviously recovery scenarios need testing, but I still don't see it being 
that bad.  My thinking on this is:

1.  Loss of a server is very much the worst case scenario.  Disk errors are 
much more likely, and with raid-z2 pools on the individual servers this should 
not pose a problem.  I also would not expect to see disk failures downing an 
entire x4500.  Sun have sold an awful lot of these now, enough for me to feel 
any such problems should be a thing of the past.

2.  Even when a server does fail, the nature of ZFS is such that you would not 
expect to loose your data, nor should you be expecting to resilver the entire 
28TB.  A motherboard / backplane / PSU failure will offline that server, but 
once the faulted components are replaced your pool will come back online.  Once 
the pool is online, ZFS has the ability to resilver just the changed data, 
meaning that your rebuild time will be simply proportional to the time the 
server was down.

Of course these failure modes would need testing, as would rebuild times.  I 
don't see 'zfs send' performance being an issue though, not unless Grey has 
another 150TB of storage lying around that he's not telling us about.  :-)

There are always going to be some tradeoffs between risk, capacity and price, 
but I expect that the benefits of this setup far outweigh the negatives.

Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Gray Carper

Howdy!

Very valuable advice here (and from Bob, who made similar comments - thanks,
Bob!). I think, then, we'll generally stick to 128K recordsizes. In the case
of databases, we'll stray as appropriate, and we may also stray with the HPC
compute cluster if we can get demonstrate that it is worth it.

To answer your questions below...

Currently, we have a single pool, in a load share configuration (no
raidz), that collects all the storage (which answers Ross' question too).
From that we carve filesystems on demand. There are many more tests planned
for that construction, though, so we are not married to it.

Redundancy abounds. ; Since the pool doesn't employ raidz, it isn't
internally redundant, but we plan to replicate the pool's data to an
identical system (which is not yet built) at another site. Our initial
userbase don't need the replication, however, because they uses the system
for little more than scratch space. Huge genomic datasets are dumped on the
storage, analyzed, and the results (which are much smaller) get sent
elsewhere. Everything is wiped out soon after that and the process starts
again. Future projected uses of the storage, however, would be far less
tolerant of loss, so I expect we'll want to reconfigure the pool in raidz.

I see that Archie and Miles have shared some harrowing concerns which we
take very seriously. I don't think I'll be able to reply to them today, but
I certainly will in the near future (particularly once we've completed some
more of our induced failure scenarios).

Sidenote: Today we made eight network/iSCSI related tweaks that, in
aggregate, have resulted in dramatic performance improvements (some I just
hadn't gotten around to yet, others suggested by Sun's Mertol Ozyoney)...

- disabling the Nagle algorithm on the head node
- setting each iSCSI target block size to match the ZFS record size of 128K
- disabling thin provisioning on the iSCSI targets
- enabling jumbo frames everywhere (each switch and NIC)
- raising ddi_msix_alloc_limit to 8
- raising ip_soft_rings_cnt to 16
- raising tcp_deferred_acks_max to 16
- raising tcp_local_dacks_max to 16

Rerunning the same tests, we now see...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
Write: 143373
Rewrite: 183170
Read: 433205
Reread: 435503
Random Read: 90118
Random Write: 19488

[8GB file size, 512KB record size]
Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f
/volumes/data-iscsi/perftest/8gbtest
Write:  463260
Rewrite:  449280
Read:  1092291
Reread:  881044
Random Read:  442565
Random Write:  565565

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
Write: 357199
Rewrite: 342788
Read: 609553
Reread: 645618
Random Read: 218874
Random Write: 339624

Thanks so much to everyone for all their great contributions!
-Gray

On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai 
[EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hi Gray,

 You've got a nice setup going there, few comments:

 1. Do not tune ZFS without a proven test-case to show otherwise, except...
 2. For databases. Tune recordsize for that particular FS to match DB
 recordsize.

 Few questions...

 * How are you divvying up the space ?
 * How are you taking care of redundancy ?
 * Are you aware that each layer of ZFS needs its own redundancy ?

 Since you have got a mixed use case here, I would be surprized if a general
 config would cover all, though it might do with some luck.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Ross

Miles makes a good point here, you really need to look at how this copes with 
various failure modes.

Based on my experience, iSCSI is something that may cause you problems.  When I 
tested this kind of setup last year I found that the entire pool hung for 3 
minutes any time an iSCSI volume went offline.  It looked like a relatively 
simple thing to fix if you can recompile the iSCSI driver, and there is talk 
about making the timeout adjustable, but for me that was enough to put our 
project on hold for now.

Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Gray Carper

Oops - one thing I meant to mention: We only plan to cross-site replicate
data for those folks who require it. The HPC data crunching would have no
use for it, so that filesystem wouldn't be replicated. In reality, we only
expect a select few users, with relatively small filesystems, to actually
need replication. (Which begs the question: Why build an identical 150TB
system to support that? Good question. I think we'll reevaluate. ;)

-Gray

On Thu, Oct 16, 2008 at 3:50 PM, Gray Carper [EMAIL PROTECTED] wrote:

 Howdy!

 Very valuable advice here (and from Bob, who made similar comments -
 thanks, Bob!). I think, then, we'll generally stick to 128K recordsizes. In
 the case of databases, we'll stray as appropriate, and we may also stray
 with the HPC compute cluster if we can get demonstrate that it is worth it.

 To answer your questions below...

 Currently, we have a single pool, in a load share configuration (no
 raidz), that collects all the storage (which answers Ross' question too).
 From that we carve filesystems on demand. There are many more tests planned
 for that construction, though, so we are not married to it.

 Redundancy abounds. ; Since the pool doesn't employ raidz, it isn't
 internally redundant, but we plan to replicate the pool's data to an
 identical system (which is not yet built) at another site. Our initial
 userbase don't need the replication, however, because they uses the system
 for little more than scratch space. Huge genomic datasets are dumped on the
 storage, analyzed, and the results (which are much smaller) get sent
 elsewhere. Everything is wiped out soon after that and the process starts
 again. Future projected uses of the storage, however, would be far less
 tolerant of loss, so I expect we'll want to reconfigure the pool in raidz.

 I see that Archie and Miles have shared some harrowing concerns which we
 take very seriously. I don't think I'll be able to reply to them today, but
 I certainly will in the near future (particularly once we've completed some
 more of our induced failure scenarios).

 Sidenote: Today we made eight network/iSCSI related tweaks that, in
 aggregate, have resulted in dramatic performance improvements (some I just
 hadn't gotten around to yet, others suggested by Sun's Mertol Ozyoney)...

 - disabling the Nagle algorithm on the head node
 - setting each iSCSI target block size to match the ZFS record size of 128K
 - disabling thin provisioning on the iSCSI targets
 - enabling jumbo frames everywhere (each switch and NIC)
 - raising ddi_msix_alloc_limit to 8
 - raising ip_soft_rings_cnt to 16
 - raising tcp_deferred_acks_max to 16
 - raising tcp_local_dacks_max to 16

 Rerunning the same tests, we now see...

 [1GB file size, 1KB record size]
 Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
 Write: 143373
 Rewrite: 183170
 Read: 433205
 Reread: 435503
 Random Read: 90118
 Random Write: 19488

 [8GB file size, 512KB record size]
 Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f
 /volumes/data-iscsi/perftest/8gbtest
 Write:  463260
 Rewrite:  449280
 Read:  1092291
 Reread:  881044
 Random Read:  442565
 Random Write:  565565

 [64GB file size, 1MB record size]
 Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
 Write: 357199
 Rewrite: 342788
 Read: 609553
 Reread: 645618
 Random Read: 218874
 Random Write: 339624

 Thanks so much to everyone for all their great contributions!
 -Gray


 On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai 
 [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hi Gray,

 You've got a nice setup going there, few comments:

 1. Do not tune ZFS without a proven test-case to show otherwise, except...
 2. For databases. Tune recordsize for that particular FS to match DB
 recordsize.

 Few questions...

 * How are you divvying up the space ?
 * How are you taking care of redundancy ?
 * Are you aware that each layer of ZFS needs its own redundancy ?

 Since you have got a mixed use case here, I would be surprized if a
 general config would cover all, though it might do with some luck.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




 --
 Gray Carper
 MSIS Technical Services
 University of Michigan Medical School
 [EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
 http://www.umms.med.umich.edu/msis/




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning for a file server, disabling data cache (almost)

2008-10-16 Thread Tomas Ögren

On 15 October, 2008 - Richard Elling sent me these 4,3K bytes:

 Tomas Ögren wrote:
  Hello.
 
  Executive summary: I want arc_data_limit (like arc_meta_limit, but for
  data) and set it to 0.5G or so. Is there any way to simulate it?

 
 We describe how to limit the size of the ARC cache in the Evil Tuning Guide.
 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

Will that limit the _data_ portion only, or the metadata as well?

/Tomas
-- 
Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning for a file server, disabling data cache (almost)

2008-10-16 Thread Darren J Moffat

Tomas Ögren wrote:
 On 15 October, 2008 - Richard Elling sent me these 4,3K bytes:
 
 Tomas Ögren wrote:
 Hello.

 Executive summary: I want arc_data_limit (like arc_meta_limit, but for
 data) and set it to 0.5G or so. Is there any way to simulate it?
   
 We describe how to limit the size of the ARC cache in the Evil Tuning Guide.
 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
 
 Will that limit the _data_ portion only, or the metadata as well?

Recent builds of OpenSolaris have the ability to control on a per 
dataset basis what is put into the ARC and L2ARC using the
primrarycache and secondarycache dataset properties:

  primarycache=all | none | metadata

  Controls what is cached in the primary cache  (ARC).  If
  this  property  is set to all, then both user data and
  metadata is cached. If this property is set  to  none,
  then  neither  user data nor metadata is cached. If this
  property is set to metadata,  then  only  metadata  is
  cached. The default value is all.

  secondarycache=all | none | metadata

  Controls what is cached in the secondary cache  (L2ARC).
  If  this  property  is set to all, then both user data
  and metadata is cached.  If  this  property  is  set  to
  none,  then  neither user data nor metadata is cached.
  If this property is set to metadata, then  only  meta-
  data is cached. The default value is all.



-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Lost Disk Space

2008-10-16 Thread Ben Rockwood

I've been struggling to fully understand why disk space seems to vanish.  I've 
dug through bits of code and reviewed all the mails on the subject that I can 
find, but I still don't have a proper understanding of whats going on.  

I did a test with a local zpool on snv_97... zfs list, zpool list, and zdb all 
seem to disagree on how much space is available.  In this case its only a 
discrepancy of about 20G or so, but I've got Thumpers that have a discrepancy 
of over 6TB!

Can someone give a really detailed explanation about whats going on?

block traversal size 670225837056 != alloc 720394438144 (leaked 50168601088)

bp count:15182232
bp logical:672332631040  avg:  44284
bp physical:   669020836352  avg:  44066compression:   1.00
bp allocated:  670225837056  avg:  44145compression:   1.00
SPA allocated: 720394438144 used: 96.40%

Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
12   120K   26.5K   79.5K   6.62K4.53 0.00  deferred free
 1512 512   1.50K   1.50K1.00 0.00  object directory
 3  1.50K   1.50K   4.50K   1.50K1.00 0.00  object array
 116K   1.50K   4.50K   4.50K   10.67 0.00  packed nvlist
 -  -   -   -   -   --  packed nvlist size
72  8.45M889K   2.60M   37.0K9.74 0.00  bplist
 -  -   -   -   -   --  bplist header
 -  -   -   -   -   --  SPA space map header
   974  4.48M   2.65M   7.94M   8.34K1.70 0.00  SPA space map
 -  -   -   -   -   --  ZIL intent log
 96.7K  1.51G389M777M   8.04K3.98 0.12  DMU dnode
17  17.0K   8.50K   17.5K   1.03K2.00 0.00  DMU objset
 -  -   -   -   -   --  DSL directory
13  6.50K   6.50K   19.5K   1.50K1.00 0.00  DSL directory child map
12  6.00K   6.00K   18.0K   1.50K1.00 0.00  DSL dataset snap map
14  38.0K   10.0K   30.0K   2.14K3.80 0.00  DSL props
 -  -   -   -   -   --  DSL dataset
 -  -   -   -   -   --  ZFS znode
 2 1K  1K  2K  1K1.00 0.00  ZFS V0 ACL
 5.81M   558G557G557G   95.8K1.0089.27  ZFS plain file
  382K   301M200M401M   1.05K1.50 0.06  ZFS directory
 9  4.50K   4.50K   9.00K  1K1.00 0.00  ZFS master node
12   482K   20.0K   40.0K   3.33K   24.10 0.00  ZFS delete queue
 8.20M  66.1G   65.4G   65.8G   8.03K1.0110.54  zvol object
 1512 512  1K  1K1.00 0.00  zvol prop
 -  -   -   -   -   --  other uint8[]
 -  -   -   -   -   --  other uint64[]
 -  -   -   -   -   --  other ZAP
 -  -   -   -   -   --  persistent error log
 1   128K   10.5K   31.5K   31.5K   12.19 0.00  SPA history
 -  -   -   -   -   --  SPA history offsets
 -  -   -   -   -   --  Pool properties
 -  -   -   -   -   --  DSL permissions
 -  -   -   -   -   --  ZFS ACL
 -  -   -   -   -   --  ZFS SYSACL
 -  -   -   -   -   --  FUID table
 -  -   -   -   -   --  FUID table size
 5  3.00K   2.50K   7.50K   1.50K1.20 0.00  DSL dataset next clones
 -  -   -   -   -   --  scrub work queue
 14.5M   626G623G624G   43.1K1.00   100.00  Total


real21m16.862s
user0m36.984s
sys 0m5.757s

===
Looking at the data:
[EMAIL PROTECTED] ~$ zfs list backup  zpool list backup
NAME USED  AVAIL  REFER  MOUNTPOINT
backup   685G   237K27K  /backup
NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
backup   696G   671G  25.1G96%  ONLINE  -

So zdb says 626GB is used, zfs list says 685GB is used, and zpool list says 
671GB is used.  The pool was filled to 100% capacity via dd, this is confirmed, 
I can't write data, but yet zpool list says its only 96%. 

benr.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Strange result when syncing between SPARC and x86

2008-10-16 Thread Casper . Dik


Hello


Today I've suddenly noticed that symlinks (at least) are corrupted when
sync ZFS from SPARC to x86 (zfs send | ssh | zfs recv).

Example is:

[EMAIL PROTECTED] ls -la /data/zones/testfs/root/etc/services
lrwxrwxrwx   1 root root  15 Oct 13 14:35
/data/zones/testfs/root/etc/services - ./inet/services

[EMAIL PROTECTED] ls -la /data/zones/testfs/root/etc/services
lrwxrwxrwx   1 root root  15 Oct 13 14:35
/data/zones/testfs/root/etc/services - s/teni/.ervices


Firstly I thought it's because original FS on SPARC is compressed... so
I've just synced it locally on same machine and all was OK just
different FS size since destination was not compressed.

Then I've synced that copy again to x86 but result was same - symlinks
were corrupted... so it's not compression.

SPARC is snv_85 and x86 snv_82, I haven't got a chance yet to test on
latest OpenSolaris.


Any suggestions?

Looks like the first 8 bytes aren't reversed.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Strange result when syncing between SPARC and x86

2008-10-16 Thread Mike Futerko

Hi


Just checked with snv_99 on x86 (VMware install) - same result :(



Regards
Mike


[EMAIL PROTECTED] wrote:
 Hello


 Today I've suddenly noticed that symlinks (at least) are corrupted when
 sync ZFS from SPARC to x86 (zfs send | ssh | zfs recv).

 Example is:

 [EMAIL PROTECTED] ls -la /data/zones/testfs/root/etc/services
 lrwxrwxrwx   1 root root  15 Oct 13 14:35
 /data/zones/testfs/root/etc/services - ./inet/services

 [EMAIL PROTECTED] ls -la /data/zones/testfs/root/etc/services
 lrwxrwxrwx   1 root root  15 Oct 13 14:35
 /data/zones/testfs/root/etc/services - s/teni/.ervices


 Firstly I thought it's because original FS on SPARC is compressed... so
 I've just synced it locally on same machine and all was OK just
 different FS size since destination was not compressed.

 Then I've synced that copy again to x86 but result was same - symlinks
 were corrupted... so it's not compression.

 SPARC is snv_85 and x86 snv_82, I haven't got a chance yet to test on
 latest OpenSolaris.


 Any suggestions?
 
 Looks like the first 8 bytes aren't reversed.
 
 Casper
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Improving zfs send performance

2008-10-16 Thread Scott Williamson

On Wed, Oct 15, 2008 at 9:37 PM, Brent Jones [EMAIL PROTECTED] wrote:


 Scott,

 Can you tell us the configuration that you're using that is working for
 you?
 Were you using RaidZ, or RaidZ2? I'm wondering what the sweetspot is
 to get a good compromise in vdevs and usable space/performance


I used RaidZ with 4x5 disk and 4x6 disk vdevs in one pool with two hot
spares. This is very similar to how the pre-installed OS shipped from sun.
Also note that I am using ssh as the transfer method.

I have not tried mbuffer with this configuration as in testing with initial
home directories of ~14GB in size it was not needed.

This configuration seems to be similar to Carsten Aulbert's evaluation,
without mbuffer in the pipe.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Ross

Ok, I'm not entirely sure this is the same problem, but it does sound fairly 
similar.  Apologies for hijacking the thread if this does turn out to be 
something else.

After following the advice here to get mbuffer working with zfs send / receive, 
I found I was only getting around 10MB/s throughput.  Thinking it was a network 
problem I started the below thread in the OpenSolaris help forum:
http://www.opensolaris.org/jive/thread.jspa?messageID=294846

Now though I don't think it's network at all.  The end result from that thread 
is that we can't see any errors in the network setup, and using nicstat and NFS 
I can show that the server is capable of 50-60MB/s over the gigabit link.  
Nicstat also shows clearly that both zfs send / receive and mbuffer are only 
sending 1/5 of that amount of data over the network.

I've completely run out of ideas of my own (but I do half expect there's a 
simple explanation I haven't thought of).  Can anybody think of a reason why 
both zfs send / receive and mbuffer would be so slow?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Ross Smith


 Try to separate the two things:  (1) Try /dev/zero - mbuffer --- network 
 --- mbuffer  /dev/null
 That should give you wirespeed
I tried that already.  It still gets just 10-11MB/s from this server.
I can get zfs send / receive and mbuffer working at 30MB/s though from a couple 
of test servers (with much lower specs).
 
 (2) Try zfs send | mbuffer  /dev/null That should give you an idea how fast 
 zfs send really is locally.
Hmm, that's better than 10MB/s, but the average is still only around 20MB/s:
summary:  942 MByte in 47.4 sec - average of 19.9 MB/s
 
I think that points to another problem though as the send mbuffer is 100% full. 
 Certainly the pool itself doesn't appear under any strain at all while this is 
going on:
 
   capacity operationsbandwidthpool used  avail   
read  write   read  write--  -  -  -  -  -  
-rc-pool  732G  1.55T171 85  21.3M  1.01M  mirror 144G   
320G 38  0  4.78M  0c1t1d0  -  -  6  0   779K   
   0c1t2d0  -  - 17  0  2.17M  0c2t1d0  -  
- 14  0  1.85M  0  mirror 146G   318G 39  0  4.89M  
0c1t3d0  -  - 20  0  2.50M  0c2t2d0  -  -   
  13  0  1.63M  0c2t0d0  -  -  6  0   779K  0  
mirror 146G   318G 34  0  4.35M  0c2t3d0  -  - 
19  0  2.39M  0c1t5d0  -  -  7  0  1002K  0
c1t4d0  -  -  7  0  1002K  0  mirror 148G   316G 23 
 0  2.93M  0c2t4d0  -  -  8  0  1.09M  0
c2t5d0  -  -  6  0   890K  0c1t6d0  -  -  7 
 0  1002K  0  mirror 148G   316G 35  0  4.35M  0
c1t7d0  -  -  6  0   779K  0c2t6d0  -  - 12 
 0  1.52M  0c2t7d0  -  - 17  0  2.07M  0  
c3d1p0  12K   504M  0 85  0  1.01M--  -  -  
-  -  -  -
Especially when compared to the zfs send stats on my backup server which 
managed 30MB/s via mbuffer (Being received on a single virtual SATA disk):
   capacity operationsbandwidthpool used  avail   
read  write   read  write--  -  -  -  -  -  
-rpool   5.12G  42.6G  0  5  0  27.1K  c4t0d0s0  5.12G  
42.6G  0  5  0  27.1K--  -  -  -  -  -  
-zfspool  431G  4.11T261  0  31.4M  0  raidz2 431G  
4.11T261  0  31.4M  0c4t1d0  -  -155  0  6.28M  
0c4t2d0  -  -155  0  6.27M  0c4t3d0  -  
-155  0  6.27M  0c4t4d0  -  -155  0  6.27M  
0c4t5d0  -  -155  0  6.27M  0--  -  -  
-  -  -  -
The really ironic thing is that the 30MB/s send / receive was sending to a 
virtual SATA disk which is stored (via sync NFS) on the server I'm having 
problems with...
 
Ross
 
 
_
Win New York holidays with Kellogg’s  Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Ross Smith



Oh dear god.  Sorry folks, it looks like the new hotmail really doesn't play 
well with the list.  Trying again in plain text:
 
 
 Try to separate the two things:
 
 (1) Try /dev/zero - mbuffer --- network --- mbuffer /dev/null
 That should give you wirespeed
 
I tried that already.  It still gets just 10-11MB/s from this server.
I can get zfs send / receive and mbuffer working at 30MB/s though from a couple 
of test servers (with much lower specs).
 
 (2) Try zfs send | mbuffer /dev/null
 That should give you an idea how fast zfs send really is locally.
 
Hmm, that's better than 10MB/s, but the average is still only around 20MB/s:
summary:  942 MByte in 47.4 sec - average of 19.9 MB/s
 
I think that points to another problem though as the send mbuffer is 100% full. 
 Certainly the pool itself doesn't appear under any strain at all while this is 
going on:
 
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rc-pool  732G  1.55T171 85  21.3M  1.01M
  mirror 144G   320G 38  0  4.78M  0
c1t1d0  -  -  6  0   779K  0
c1t2d0  -  - 17  0  2.17M  0
c2t1d0  -  - 14  0  1.85M  0
  mirror 146G   318G 39  0  4.89M  0
c1t3d0  -  - 20  0  2.50M  0
c2t2d0  -  - 13  0  1.63M  0
c2t0d0  -  -  6  0   779K  0
  mirror 146G   318G 34  0  4.35M  0
c2t3d0  -  - 19  0  2.39M  0
c1t5d0  -  -  7  0  1002K  0
c1t4d0  -  -  7  0  1002K  0
  mirror 148G   316G 23  0  2.93M  0
c2t4d0  -  -  8  0  1.09M  0
c2t5d0  -  -  6  0   890K  0
c1t6d0  -  -  7  0  1002K  0
  mirror 148G   316G 35  0  4.35M  0
c1t7d0  -  -  6  0   779K  0
c2t6d0  -  - 12  0  1.52M  0
c2t7d0  -  - 17  0  2.07M  0
  c3d1p0  12K   504M  0 85  0  1.01M
--  -  -  -  -  -  -
 
Especially when compared to the zfs send stats on my backup server which 
managed 30MB/s via mbuffer (Being received on a single virtual SATA disk):
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
rpool   5.12G  42.6G  0  5  0  27.1K
  c4t0d0s0  5.12G  42.6G  0  5  0  27.1K
--  -  -  -  -  -  -
zfspool  431G  4.11T261  0  31.4M  0
  raidz2 431G  4.11T261  0  31.4M  0
c4t1d0  -  -155  0  6.28M  0
c4t2d0  -  -155  0  6.27M  0
c4t3d0  -  -155  0  6.27M  0
c4t4d0  -  -155  0  6.27M  0
c4t5d0  -  -155  0  6.27M  0
--  -  -  -  -  -  -
The really ironic thing is that the 30MB/s send / receive was sending to a 
virtual SATA disk which is stored (via sync NFS) on the server I'm having 
problems with...
 
Ross

 

 Date: Thu, 16 Oct 2008 14:27:49 +0200
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 CC: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Improving zfs send performance
 
 Hi Ross
 
 Ross wrote:
 Now though I don't think it's network at all. The end result from that 
 thread is that we can't see any errors in the network setup, and using 
 nicstat and NFS I can show that the server is capable of 50-60MB/s over the 
 gigabit link. Nicstat also shows clearly that both zfs send / receive and 
 mbuffer are only sending 1/5 of that amount of data over the network.
 
 I've completely run out of ideas of my own (but I do half expect there's a 
 simple explanation I haven't thought of). Can anybody think of a reason why 
 both zfs send / receive and mbuffer would be so slow?
 
 Try to separate the two things:
 
 (1) Try /dev/zero - mbuffer --- network --- mbuffer /dev/null
 
 That should give you wirespeed
 
 (2) Try zfs send | mbuffer /dev/null
 
 That should give you an idea how fast zfs send really is locally.
 
 Carsten
_
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning for a file server, disabling data cache (almost)

2008-10-16 Thread Tomas Ögren

On 16 October, 2008 - Darren J Moffat sent me these 1,7K bytes:

 Tomas Ögren wrote:
  On 15 October, 2008 - Richard Elling sent me these 4,3K bytes:
  
  Tomas Ögren wrote:
  Hello.
 
  Executive summary: I want arc_data_limit (like arc_meta_limit, but for
  data) and set it to 0.5G or so. Is there any way to simulate it?

  We describe how to limit the size of the ARC cache in the Evil Tuning 
  Guide.
  http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
  
  Will that limit the _data_ portion only, or the metadata as well?
 
 Recent builds of OpenSolaris have the ability to control on a per 
 dataset basis what is put into the ARC and L2ARC using the
 primrarycache and secondarycache dataset properties:
 
   primarycache=all | none | metadata
 
   Controls what is cached in the primary cache  (ARC).  If
   this  property  is set to all, then both user data and
   metadata is cached. If this property is set  to  none,
   then  neither  user data nor metadata is cached. If this
   property is set to metadata,  then  only  metadata  is
   cached. The default value is all.
 
   secondarycache=all | none | metadata
 
   Controls what is cached in the secondary cache  (L2ARC).
   If  this  property  is set to all, then both user data
   and metadata is cached.  If  this  property  is  set  to
   none,  then  neither user data nor metadata is cached.
   If this property is set to metadata, then  only  meta-
   data is cached. The default value is all.

Yeah, the problem is (like I wrote in the first post), if I set
primarycache=metadata, then ZFS prefetch will go into horribly
inefficient mode where it will do lots of prefetching, but the
prefetched data will be discarded immediately.

128k prefetch for a 32k read will throw away the other 96k immediately.
Followed by another 128k prefetch for the next 32k read, throwing away
the other 96k.

So ZFS needs to have _some_ data cache, but I want to limit it for
short term data only.. Setting data cache limit to 512M or something
should work fine, but I want to leave the rest to metadata as that's the
place where it can help the most.

Unless I can do some trickery with a ram disk and put that as
secondarycache with data cache as well..

/Tomas
-- 
Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs cp hangs when the mirrors are removed ..

2008-10-16 Thread Richard Elling

Karthik Krishnamoorthy wrote:
 We did try with this

 zpool set failmode=continue pool option

 and the wait option before pulling running the cp command and pulling 
 out the mirrors and in both cases there was a hang and I have a core 
 dump of the hang as well.
   

You have to wait for the I/O drivers to declare that the device is
dead.  This can be up to several minutes, depending on the driver.

 Any pointers to the bug opening process ?
   

http://bugs.opensolaris.org, or bugster if you have an account.
Be sure to indicate which drivers you are using, as this is not likely
a ZFS bug, per se.  Output from prtconf -D should be a minimum.
 -- richard

 Thanks
 Karthik

 On 10/15/08 22:27, Neil Perrin wrote:
   
 On 10/15/08 23:12, Karthik Krishnamoorthy wrote:
 
 Neil,

 Thanks for the quick suggestion, the hang seems to happen even with 
 the zpool set failmode=continue pool option.

 Any other way to recover from the hang ?
   
 You should set the property before you remove the devices.
 This should prevent the hang. It isn't used to recover from it.

 If you did do that then it seems like a bug somewhere in ZFS or the IO 
 stack
 below it. In which case you should file a bug.

 Neil.
 
 thanks and regards,
 Karthik

 On 10/15/08 22:03, Neil Perrin wrote:
   
 Karthik,

 The pool failmode property as implemented governs the behaviour when 
 all
 the devices needed are unavailable. The default behaviour is to wait
 (block) until the IO can continue - perhaps by re-enabling the 
 device(s).
 The behaviour you expected can be achieved by zpool set 
 failmode=continue pool,
 as shown in the link you indicated below.

 Neil.

 On 10/15/08 22:38, Karthik Krishnamoorthy wrote:
 
 Hello All,

   Summary:
   
   cp command for mirrored zfs hung when all the disks in the mirrored
   pool were unavailable.
 Detailed description:
   ~
 The cp command (copy a 1GB file from nfs to zfs) hung when all 
 the disks
   in the mirrored pool (both c1t0d9 and c2t0d9) were removed 
 physically.
NAMESTATE READ WRITE CKSUM
  testONLINE  0 0 0
mirrorONLINE  0 0 0
  c1t0d9  ONLINE  0 0 0
  c2t0d9  ONLINE  0 0 0
 We think if all the disks in the pool are unavailable, cp 
 command should
   fail with error (not cause hang).
 Our request:
   
   Please investigate the root cause of this issue.
  
   How to reproduce:
   ~
   1. create a zfs mirrored pool
   2. execute cp command from somewhere to the zfs mirrored pool.
   3. remove the both of disks physically during cp command working
 =  hang happen (cp command never return and we can't kill cp 
 command)

 One engineer pointed me to this page  
 http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ 
 and indicated that if all the mirrors are removed zfs enters a hang 
 like state to prevent the kernel from going into a panic mode and 
 this type of feature would be an RFE.

 My questions are

 Are there any documentation of the mirror configuration of zfs 
 that explains what happens when the underlying
 drivers detect problems in one of the mirror devices?

 It seems that the traditional views of mirror or raid-2 would 
 expect that the
 mirror would be able to proceed without interruption and that does 
 not seem to be this case in ZFS.
 What is the purpose of the mirror, in zfs?  Is it more like an instant
 backup?  If so, what can the user do to recover, when there is an
 IO error on one of the devices?


 Appreciate any pointers and help,

 Thanks and regards,
 Karthik
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Strange result when syncing between SPARC and x86

2008-10-16 Thread Mike Futerko

Hello


Today I've suddenly noticed that symlinks (at least) are corrupted when
sync ZFS from SPARC to x86 (zfs send | ssh | zfs recv).

Example is:

[EMAIL PROTECTED] ls -la /data/zones/testfs/root/etc/services
lrwxrwxrwx   1 root root  15 Oct 13 14:35
/data/zones/testfs/root/etc/services - ./inet/services

[EMAIL PROTECTED] ls -la /data/zones/testfs/root/etc/services
lrwxrwxrwx   1 root root  15 Oct 13 14:35
/data/zones/testfs/root/etc/services - s/teni/.ervices


Firstly I thought it's because original FS on SPARC is compressed... so
I've just synced it locally on same machine and all was OK just
different FS size since destination was not compressed.

Then I've synced that copy again to x86 but result was same - symlinks
were corrupted... so it's not compression.

SPARC is snv_85 and x86 snv_82, I haven't got a chance yet to test on
latest OpenSolaris.


Any suggestions?


Thanks
Mike

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning for a file server, disabling data cache (almost)

2008-10-16 Thread Ross

I might be misunderstanding here, but I don't see how you're going to improve 
on zfs set primarycache=metadata.

You complain that ZFS throws away 96kb of data if you're only reading 32kb at a 
time, but then also complain that you are IO/s bound and that this is 
restricting your maximum transfer rate.  If it's io/s that is limiting you it 
makes no difference that ZFS is throwing away 96kb of data, you're going to get 
the same iops and same throughput at your application whether you're using 32k 
or 128k zfs record sizes.

Also, you're asking on one hand for each disk to get larger IO blocks, and on 
the other you are complaining that with large block sizes a lot of data is 
wasted.  That looks like a contradictory argument to me as you can't you're 
asking have both of these.  You just need to pick whichever one is more suited 
to your needs.

Like I said, I may be misunderstanding, but I think you might be looking for 
something that you don't actually need.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS pool not imported on boot on Solaris Xen PV DomU

2008-10-16 Thread Francois Goudal

Hi,
I am trying a setup with a Linux Xen Dom0 on which runs an OpenSolaris 2008.05 
DomU.
I have 8 hard disk partitions that I exported to the DomU (they are visible as 
c4d[1-8]p0)
I have created a raidz2 pool on these virtual disks.
Now, if I shutdown the system and I start it again, the pool is not 
automatically imported during the boot.
If I type zpool status, I can't see it, so I do a zfs import and I see that 
there is my pool that I can import, so I import it and it works.
But I wonder why it isn't imported automatically. How is managed the pool 
import during bootup ? Does solaris try to import every single pool that's 
available, or does it read some list from a file somewhere (possibly the 
boot_archive) ?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Tuning for a file server, disabling data cache (almost)

2008-10-16 Thread Richard Elling

Tomas Ögren wrote:
 On 16 October, 2008 - Darren J Moffat sent me these 1,7K bytes:

   
 Tomas Ögren wrote:
 
 On 15 October, 2008 - Richard Elling sent me these 4,3K bytes:

   
 Tomas Ögren wrote:
 
 Hello.

 Executive summary: I want arc_data_limit (like arc_meta_limit, but for
 data) and set it to 0.5G or so. Is there any way to simulate it?
   
   
 We describe how to limit the size of the ARC cache in the Evil Tuning 
 Guide.
 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
 
 Will that limit the _data_ portion only, or the metadata as well?
   
 Recent builds of OpenSolaris have the ability to control on a per 
 dataset basis what is put into the ARC and L2ARC using the
 primrarycache and secondarycache dataset properties:

   primarycache=all | none | metadata

   Controls what is cached in the primary cache  (ARC).  If
   this  property  is set to all, then both user data and
   metadata is cached. If this property is set  to  none,
   then  neither  user data nor metadata is cached. If this
   property is set to metadata,  then  only  metadata  is
   cached. The default value is all.

   secondarycache=all | none | metadata

   Controls what is cached in the secondary cache  (L2ARC).
   If  this  property  is set to all, then both user data
   and metadata is cached.  If  this  property  is  set  to
   none,  then  neither user data nor metadata is cached.
   If this property is set to metadata, then  only  meta-
   data is cached. The default value is all.
 

 Yeah, the problem is (like I wrote in the first post), if I set
 primarycache=metadata, then ZFS prefetch will go into horribly
 inefficient mode where it will do lots of prefetching, but the
 prefetched data will be discarded immediately.

 128k prefetch for a 32k read will throw away the other 96k immediately.
 Followed by another 128k prefetch for the next 32k read, throwing away
 the other 96k.
   

Are you sure this is a prefetch, or is it just the recordsize?
The checksum is based on the record, so to validate the checksum
the entire record must be read.  If you have a fixed record record
sized workload where the size  128 kBytes, then you might
adjust the recordsize parameter.
 -- richard

 So ZFS needs to have _some_ data cache, but I want to limit it for
 short term data only.. Setting data cache limit to 512M or something
 should work fine, but I want to leave the rest to metadata as that's the
 place where it can help the most.

 Unless I can do some trickery with a ram disk and put that as
 secondarycache with data cache as well..

 /Tomas
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Carsten Aulbert

Hi Scott,

Scott Williamson wrote:
 You seem to be using dd for write testing. In my testing I noted that
 there was a large difference in write speed between using dd to write
 from /dev/zero and using other files. Writing from /dev/zero always
 seemed to be fast, reaching the maximum of ~200MB/s and using cp which
 would perform poorler the fewer the vdevs.

You are right, the write benchmarks were done with dd just to have some
bulk bulk figures since usually zeros can be generated fast enough.

 
 This also impacted the zfs send speed, as with fewer vdevs in RaidZ2 the
 disks seemed to spend most of their time seeking during the send.
 

That seems a bit too simplistic to me. If you compare raidz with raidz2
it seems that raidz2 is not too bad with fewer vdevs. I wish there was a
way for zfs send to avoid so many seeks. The  1 TB file system is
still being zfs send, now close to 48 hours.

Cheers

Carsten

PS: We still have a spare thumper sitting around, maybe I give it a try
with 5 vdevs
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Carsten Aulbert

Hi Ross

Ross wrote:
 Now though I don't think it's network at all.  The end result from that 
 thread is that we can't see any errors in the network setup, and using 
 nicstat and NFS I can show that the server is capable of 50-60MB/s over the 
 gigabit link.  Nicstat also shows clearly that both zfs send / receive and 
 mbuffer are only sending 1/5 of that amount of data over the network.
 
 I've completely run out of ideas of my own (but I do half expect there's a 
 simple explanation I haven't thought of).  Can anybody think of a reason why 
 both zfs send / receive and mbuffer would be so slow?

Try to separate the two things:

(1) Try /dev/zero - mbuffer --- network --- mbuffer  /dev/null

That should give you wirespeed

(2) Try zfs send | mbuffer  /dev/null

That should give you an idea how fast zfs send really is locally.

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving zfs send performance

2008-10-16 Thread Scott Williamson

Hi Carsten,

You seem to be using dd for write testing. In my testing I noted that there
was a large difference in write speed between using dd to write from
/dev/zero and using other files. Writing from /dev/zero always seemed to be
fast, reaching the maximum of ~200MB/s and using cp which would perform
poorler the fewer the vdevs.

This also impacted the zfs send speed, as with fewer vdevs in RaidZ2 the
disks seemed to spend most of their time seeking during the send.

On Thu, Oct 16, 2008 at 1:27 AM, Carsten Aulbert [EMAIL PROTECTED]
wrote:

Some time ago I made some tests to find this:

(1) create a new zpool
(2) Copy user's home to it (always the same ~ 25 GB IIRC)
(3) zfs send to /dev/null
(4) evaluate continue loop

I did this for fully mirrored setups, raidz as well as raidz2, the
results were mixed:

https://n0.aei.uni-hannover.de/cgi-bin/twiki/view/ATLAS/ZFSBenchmarkTest#ZFS_send_performance_relevant_fo

The culprit here might be that in retrospect this seemed like a good
home filesystem, i.e. one which was quite fast.

If you don't want to bother with the table:

Mirrored setup never exceeded 58 MB/s and was getting faster the more
small mirrors you used.

RaidZ had its sweetspot with a configuration of '6 6 6 6 6 6 5 5', i.e.
6 or 5 disks per RaidZ and 8 vdevs

RaidZ2 finally was best at '10 9 9 9 9', i.e. 5 vdevs but not much worse
with only 3, i.e. what we are currently using to get more storage space
(gains us about 2 TB/box).

Cheers

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS pool not imported on boot on Solaris Xen PV DomU

2008-10-16 Thread Richard Elling

Francois Goudal wrote:
 Hi,
 I am trying a setup with a Linux Xen Dom0 on which runs an OpenSolaris 
 2008.05 DomU.
 I have 8 hard disk partitions that I exported to the DomU (they are visible 
 as c4d[1-8]p0)
 I have created a raidz2 pool on these virtual disks.
 Now, if I shutdown the system and I start it again, the pool is not 
 automatically imported during the boot.
 If I type zpool status, I can't see it, so I do a zfs import and I see that 
 there is my pool that I can import, so I import it and it works.
 But I wonder why it isn't imported automatically. How is managed the pool 
 import during bootup ? Does solaris try to import every single pool that's 
 available, or does it read some list from a file somewhere (possibly the 
 boot_archive) ?
   

The file is /etc/zfs/zpool.cache
Unfortunately, it is not human readable, but zdb -C can be used to
examine its contents.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] HELP! SNV_97, 98, 99 zfs with iscsitadm and VMWare!

2008-10-16 Thread Ryan Arneson

Tano wrote:
 I'm not sure if this is a problem with the iscsitarget or zfs. I'd greatly 
 appreciate it if it gets moved to the proper list.

 Well I'm just about out of ideas on what might be wrong..

 Quick history:

 I installed OS 2008.05 when it was SNV_86 to try out ZFS with VMWare. Found 
 out that multilun's were being treated as multipaths so waited till SNV_94 
 came out to fix the issues with VMWARE and iscsitadm/zfs shareiscsi=on.

 I Installed OS2008.05 on a virtual machine as a test bed, pkg image-update to 
 SNV_94 a month ago, made some thin provisioned partitions, shared them with 
 iscsitadm and mounted on VMWare without any problems. Ran storage VMotion and 
 all went well.

 So with this success I purchased a Dell 1900 with a PERC 5/i controller 6 x 
 15K SAS DRIVEs with ZFS RAIDZ1 configuration. I shared the zfs partitions and 
 mounted them on VMWare. Everything is great till I have to write to the disks.

 It won't write!
   

What's the error exactly?
What step are you performing to get the error? Creating the vmfs3 
filesystem? Accessing the mountpoint?


 Steps I took creating the disks

 1) Installed mega_sas drivers.
 2) zpool create tank raidz c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0
 3) zfs create -V 1TB tank/disk1
 4) zfs create -V 1TB tank/disk2
 5) iscsitadm create target -b /dev/zvol/rdsk/tank/disk1 LABEL1
 6) iscsitadm create target -b /dev/zvol/rdsk/tank/disk2 LABEL2

 Now both drives are lun 0 but with uniqu VMHBA device identifiers. SO they 
 are detected as seperate drives.

 I then redid (deleted) step 5 and 6 and changed it too

 5) iscsitadm create target -u 0 -b /dev/zvol/rdsk/tank/disk1 LABEL1
 6) iscsitadm create target -u 1 -b /dev/zvol/rdsk/tank/disk2 LABEL1

 VMWARE discovers the seperate LUNs on the Device identifier, but still unable 
 to write to the iscsi luns.

 Why is it that the steps I've conducted in SNV_94 work but in SNV_97,98, or 
 99 don't.

 Any ideas?? any log files I can check? I am still an ignorant linux user so I 
 only know to look in /var/log :)
   
The relevant errors from /var/log/vmkernel on the ESX server would be 
helpful.

Also, iscsitadm list target -v

Also, I blogged a bit on OpenSolaris iSCSI  VMware ESX I was using 
b98 on a X4500.

http://blogs.sun.com/rarneson/entry/zfs_clones_iscsi_and_vmware

 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


-- 
Ryan Arneson
Sun Microsystems, Inc.
303-223-6264
[EMAIL PROTECTED]
http://blogs.sun.com/rarneson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] am I screwed?

2008-10-16 Thread Johan Hartzenberg

On Mon, Oct 13, 2008 at 10:25 PM, dick hoogendijk [EMAIL PROTECTED] wrote:




 We have to dig deeper with kmdb. But before we do that, tell me please
 what is an easy way to transfer the messages from the failsafe login on
 the problematic machine to i.e. this S10u5 server. All former screen
 output had to be typed in by hand. I didn't know of another way.


If you say no to mount the pool on /a, does it still hang?

Just to ask the obvious question, did you try to press ENTER or anything
else where it was hanging?

What build are you booting into failsafe mode?  Something older, or b99?

Do you have a build-99 DVD to boot from, from which you can get a proper
running system with networking, etc?


-- 
Any sufficiently advanced technology is indistinguishable from magic.
   Arthur C. Clarke

My blog: http://initialprogramload.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Miles Nordin

 r == Ross  [EMAIL PROTECTED] writes:

 r 1.  Loss of a server is very much the worst case scenario.
 r Disk errors are much more likely, and with raid-z2 pools on
 r the individual servers

yeah, it kind of sucks that the slow resilvering speed enforces this
two-tier scheme.

Also if you're going to have 1000 spinning platters you'll have a
drive failure every four days or so---you need to be able to do more
than one resilver at a time, and you need to do resilvers without
interrupting scrubs which could take so long to run that you run them
continuously.  The ZFS-on-zvol hack lets you do both to a point, but I
think it's an ugly workaround for lack of scalability in flat ZFS, not
the ideal way to do things.

 r A motherboard / backplane / PSU failure will offline that
 r server, but once the faulted components are replaced your pool
 r will come back online.  Once the pool is online, ZFS has the
 r ability to resilver just the changed data,

except that is not what actually happens for my iSCSI setup.  If I
'zpool offline' the target before taking it down, it usually does work
as you describe---a relatively fast resilver kicks off, and no CKSUM
errors appear later.  I've used it gently.  I haven't offlined a
raidz2 device for three weeks while writing gigabytes to the pool in
the mean time, but for my gentle use it does seem to work.

But if the iSCSI target goes down unexpectedly---ex., because I pull
the network cord---it does come back online and does resilver, but
latent CKSUM errors show up weeks later.

Also, if the head node reboots during a resilver, ZFS totally forgets
what it was doing, and upon reboot just blindly mounts the unclean
component as if it were clean, later calling all the differences CKSUM
errors.  same thing happens if you offline a device, then reboot.  The
``persistent'' offlining doesn't seem to work, and in any case the
device comes online without a proper resilver.

SVM had dirty-region logging stored in the metadb so that resilvers
could continue where they left off across reboots.  I believe SVM
usually did a full resilver when a component disappeared, but am not
sure this was always the case.  Anyway ZFS doesn't seem to have a
similar capability, at least not one that works.

so, in practice, whenever any iSCSI component goes away
unexpectedly---target server failure, power failure, kernel panic, L2
spanning tree reconfiguration, whatever---you have to scrub the whole
pool from the head node.


It's interesting how the speed and optimisation of these maintenance
activities limit pool size.  It's not just full scrubs.  If the
filesystem is subject to corruption, you need a backup.  If the
filesystem takes two months to back up / restore, then you need really
solid incremental backup/restore features, and the backup needs to be
a cold spare, not just a backup---restoring means switching the roles
of the primary and backup system, not actually moving data.  

finally, for really big pools, even O(n) might be too slow.  The ZFS
best practice guide for converting UFS to ZFS says ``start multiple
rsync's in parallel,'' but I think we're finding zpool scrubs and zfs
sends are not well-parallelized.

These reliability limitations and performance characteristics of
maintenance tasks seem to make a sort of max-pool-size Wall beyond
which you end up painted into corners.  If they were made better, I
think you'd later hit another wall at the maximum amount of data you
could push through one head node and would have to switch to some
QFS/GFS/OCFS-type separate-data-and-metadata filesystem, and to match
ZFS this filesystem would have to do scrubs, resilvers, and backups in
a distributed way not just distribute normal data access.  A month ago
I might have ranted, ``head node speed puts a cap on how _busy_ the
filesystem can be, not how big it can be, so ZFS (modulo a lot of bug
fixes) could be fantastic for data sets of virtually unlimited size
even with its single-initiator, single-head-node limitation, so long
as the pool gets very light access.''  Now, I don't think so, because
scrubbing/resilvering/backup-restore has to flow through the head
node, too.

This observation also means my preference for a ``recovery tool'' that
treats corrupt pools as read-only over fsck (online or offline) isn't
very scalable.  The original zfs kool-aid ``online maintenance'' model
of doing a cheap fsck at import time and a long O(n) fsck through
online scrubs is the only one with a future in a world where
maintenance activities can take months.


pgpzqaJe5ZecE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson

[EMAIL PROTECTED] said:
 It's interesting how the speed and optimisation of these maintenance
 activities limit pool size.  It's not just full scrubs.  If the filesystem is
 subject to corruption, you need a backup.  If the filesystem takes two months
 to back up / restore, then you need really solid incremental backup/restore
 features, and the backup needs to be a cold spare, not just a
 backup---restoring means switching the roles of the primary and backup
 system, not actually moving data.   

I'll chime in here with feeling uncomfortable with such a huge ZFS pool,
and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
just seem to be too many moving parts depending on each other, any one of
which can make the entire pool unavailable.

For the stated usage of the original poster, I think I would aim toward
turning each of the Thumpers into an NFS server, configure the head-node
as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS
to the cluster of file servers.  You'll end up with a huge logical pool,
but a Thumper outage should result only in loss of access to the data on
that particular system.  The work of scrub/resilver/replication can be
divided among the servers rather than all living on a single head node.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Erast Benson

pNFS is NFS-centric of course and it is not yet stable, isn't it? btw,
what is the ETA for pNFS putback?

On Thu, 2008-10-16 at 12:20 -0700, Marion Hakanson wrote:
 [EMAIL PROTECTED] said:
  It's interesting how the speed and optimisation of these maintenance
  activities limit pool size.  It's not just full scrubs.  If the filesystem 
  is
  subject to corruption, you need a backup.  If the filesystem takes two 
  months
  to back up / restore, then you need really solid incremental backup/restore
  features, and the backup needs to be a cold spare, not just a
  backup---restoring means switching the roles of the primary and backup
  system, not actually moving data.   
 
 I'll chime in here with feeling uncomfortable with such a huge ZFS pool,
 and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
 just seem to be too many moving parts depending on each other, any one of
 which can make the entire pool unavailable.
 
 For the stated usage of the original poster, I think I would aim toward
 turning each of the Thumpers into an NFS server, configure the head-node
 as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS
 to the cluster of file servers.  You'll end up with a huge logical pool,
 but a Thumper outage should result only in loss of access to the data on
 that particular system.  The work of scrub/resilver/replication can be
 divided among the servers rather than all living on a single head node.
 
 Regards,
 
 Marion
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Nicolas Williams

On Thu, Oct 16, 2008 at 12:20:36PM -0700, Marion Hakanson wrote:
 I'll chime in here with feeling uncomfortable with such a huge ZFS pool,
 and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
 just seem to be too many moving parts depending on each other, any one of
 which can make the entire pool unavailable.

But does it work well enough?  It may be faster than NFS if there's only
one client for each volume (unless you have fast slog devices for the
ZIL).  And it'd have better semantics too (e.g., no need for the client
and server to agree on identities/domains).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson

[EMAIL PROTECTED] said:
 In general, such tasks would be better served by T5220 (or the new T5440 :-)
 and J4500s.  This would change the data paths from:
 client --net-- T5220 --net-- X4500 --SATA-- disks to
 client --net-- T5440 --SAS-- disks
 
 With the J4500 you get the same storage density as the X4500, but with SAS
 access (some would call this direct access).  You will have much better
 bandwidth and lower latency between the T5440 (server) and disks while still
 having the ability to multi-head the disks.  The 

There's an odd economic factor here, if you're in the .edu sector:  The
Sun Education Essentials promotional price list has the X4540 priced
lower than a bare J4500 (not on the promotional list, but with a standard
EDU discount).

We have a project under development right now which might be served well
by one of these EDU X4540's with a J4400 attached to it.  The spec sheets
for J4400 and J4500 say you can chain together enough of them to make a
pool of 192 drives.  I'm unsure about the bandwidth of these daisy-chained
SAS interconnects, though.  Any thoughts as to how high one might scale
an X4540-plus-J4x00 solution?  How does the X4540's internal disk bandwidth
compare to that of the (non-RAID) SAS HBA?

Regards,

Marion



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Miles Nordin

 nw == Nicolas Williams [EMAIL PROTECTED] writes:

nw But does it work well enough?  It may be faster than NFS if

You're talking about different things.  Gray is using NFS period
between the storage cluster and the compute cluster, no iSCSI.

Gray's (``does it work well enough''):  iSCSI within storage cluster
NFS to storage consumers

Marion's (less ``uncomfortable''):  nothing(?) within storage cluster
pNFS to storage consumers

but Marion's is not really possible at all, and won't be for a while
with other groups' choice of storage-consumer platform, so it'd have
to be GlusterFS or some other goofy fringe FUSEy thing or
not-very-general crude in-house hack.

I guess since Gray is copying data in and out all the time he doesn't
have to worry about the glacial-restore problem and corruption
problem.  If it were my worry, I'd definitely include NFS clients in
the performance test because iSCSI is high-latency, and the NFS
clients could be more latency-sensitive than the local benchmark.  I
might test coalescing in the big data separately from running the
crunching, because maybe the big data can be copied in with
pax-over-netcat, or something other than NFS, and maybe the crunching
could treat the big data as read-only and write its small result to a
fast standalone ZFS server which would make NFS faster.  and i'd get
the small important data that needs backup off this mess (but please
let us know how the failure simulating testing goes!).


pgpM2yKwKqo4d.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] HELP! SNV_97, 98, 99 zfs with iscsitadm and VMWare!

2008-10-16 Thread Tano

Thank you Ryan for your response, I have included all the information you 
requested in line to this document:

I will be testing SNV_86 again to see whether the problem persists, maybe it's 
my hardware. I will confirm that soon enough.

On Thu, October 16, 2008 10:31 am, Ryan Arneson wrote:
 Tano wrote:
 I'm not sure if this is a problem with the iscsitarget or zfs. I'd greatly
 appreciate it if it gets moved to the proper list.

 Well I'm just about out of ideas on what might be wrong..

 Quick history:

 I installed OS 2008.05 when it was SNV_86 to try out ZFS with VMWare. Found
 out that multilun's were being treated as multipaths so waited till SNV_94
 came out to fix the issues with VMWARE and iscsitadm/zfs shareiscsi=on.

 I Installed OS2008.05 on a virtual machine as a test bed, pkg image-update
 to SNV_94 a month ago, made some thin provisioned partitions, shared them
 with iscsitadm and mounted on VMWare without any problems. Ran storage
 VMotion and all went well.

 So with this success I purchased a Dell 1900 with a PERC 5/i controller 6 x
 15K SAS DRIVEs with ZFS RAIDZ1 configuration. I shared the zfs partitions
 and mounted them on VMWare. Everything is great till I have to write to the
 disks.

 It won't write!

 
 What's the error exactly?

From the VMWARE Infrastructure front end, everything looks like is in order.
I Send Targets to the iscsi IP, then rescan the HBA and it detects all the 
LUNs and Targets.

 What step are you performing to get the error? Creating the vmfs3
 filesystem? Accessing the mountpoint?

The error occurs when attempting to write large data sets to the mount point. 
Formatting the drive VMFS3 works, manually copying 5 megabytes of data to the 
Target works. Running cp -a of the VM folder or cold VM migration will hang the 
infrastructure client and the ESX host lags. No timeouts of any sort will 
occur. I waited up to an hour.

 

 Steps I took creating the disks

 1) Installed mega_sas drivers.
 2) zpool create tank raidz c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0
 3) zfs create -V 1TB tank/disk1
 4) zfs create -V 1TB tank/disk2
 5) iscsitadm create target -b /dev/zvol/rdsk/tank/disk1 LABEL1
 6) iscsitadm create target -b /dev/zvol/rdsk/tank/disk2 LABEL2

 Now both drives are lun 0 but with uniqu VMHBA device identifiers. SO they
 are detected as seperate drives.

 I then redid (deleted) step 5 and 6 and changed it too

 5) iscsitadm create target -u 0 -b /dev/zvol/rdsk/tank/disk1 LABEL1
 6) iscsitadm create target -u 1 -b /dev/zvol/rdsk/tank/disk2 LABEL1

 VMWARE discovers the seperate LUNs on the Device identifier, but still
 unable to write to the iscsi luns.

 Why is it that the steps I've conducted in SNV_94 work but in SNV_97,98, or
 99 don't.

 Any ideas?? any log files I can check? I am still an ignorant linux user so
 I only know to look in /var/log :)

 The relevant errors from /var/log/vmkernel on the ESX server would be
 helpful.
 

So I weeded out the best that I could the logs from /var/log/vmkernel. 
Basically everytime I initiated a command from vmware I captured the logs.  I 
have broken down what I was doing at what point in the logs.


Again the complete breakdown of both systems: 

[b]VMware ESX 3.5 Update 2[/b]
[EMAIL PROTECTED] log]# uname -a
Linux vmware-860-1.ucr.edu 2.4.21-57.ELvmnix #1 Tue Aug 12 17:28:03 PDT 2008 
i686 i686 i386 GNU/Linux

[EMAIL PROTECTED] log]# arch
i686

[b]Opensolaris:[/b] 
Dell Poweredge 1900 PERC 5/i  6 disk 450GB each SAS 15kRPM
Broadcomm BNX driver: no conflicts. Quadcore 1600 Mhz 1066 FSB 8 GB RAM
 
[EMAIL PROTECTED]:~# uname -a
SunOS iscsi-sas 5.11 snv_99 i86pc i386 i86pc Solaris

[EMAIL PROTECTED]:~# isainfo -v
64-bit amd64 applications
ssse3 cx16 mon sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu
32-bit i386 applications
ssse3 ahf cx16 mon sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu


[EMAIL PROTECTED]:~# zpool status -v
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t0d0s0  ONLINE   0 0 0
c3t1d0ONLINE   0 0 0

errors: No known data errors

  pool: vdrive
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
vdrive  ONLINE   0 0 0
  raidz1ONLINE   0 0 0
c5t0d0  ONLINE   0 0 0
c5t1d0  ONLINE   0 0 0
c5t2d0  ONLINE   0 0 0
c5t3d0  ONLINE   0 0 0
c5t4d0  ONLINE   0 0 0
c5t5d0  ONLINE   0 0 0

errors: No known data errors
[EMAIL PROTECTED]:~#

[EMAIL PROTECTED]:~# zfs create -V 750G vdrive/LUNA
[EMAIL PROTECTED]:~# zfs create -V 1250G vdrive/LUNB

[EMAIL PROTECTED]:~# zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
rpool

Re: [zfs-discuss] HELP! SNV_97, 98, 99 zfs with iscsitadm and VMWare!

2008-10-16 Thread Tano

Also I had read your blog post previously.

I will be taking advantage of the cloning/snapshot section of your blog once I 
am successful writing to the Targets.

Thanks again!
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Nicolas Williams

On Thu, Oct 16, 2008 at 04:30:28PM -0400, Miles Nordin wrote:
  nw == Nicolas Williams [EMAIL PROTECTED] writes:
 
 nw But does it work well enough?  It may be faster than NFS if
 
 You're talking about different things.  Gray is using NFS period
 between the storage cluster and the compute cluster, no iSCSI.

I was replying to Marion's comment about ZFS-over-ISCSI-on-ZFS, not to
Gray.

I can see why one might worry about ZFS-over-iSCSI-on-ZFS.  Two layers
of copy-on-write might interact in odd ways that kill performance.  But
if you want ZFS-over-iSCSI in the first place then ZFS-over-iSCSI-on-ZFS
sounds like the correct approach IF it can perform well enough.

ZFS-over-iSCSI could certainly perform better than NFS, but again, it
may depend on what kind of ZIL devices you have.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Miles Nordin

 nw == Nicolas Williams [EMAIL PROTECTED] writes:
 mh == Marion Hakanson [EMAIL PROTECTED] writes:

nw I was replying to Marion's [...]
nw ZFS-over-iSCSI could certainly perform better than NFS,

better than what, ZFS-over-'mkfile'-files-on-NFS?  No one was
suggesting that.  Do you mean better than pNFS?  It sounded at first
like you meant iSCSI-over-ZFS should perform better than NFS, but no
one's suggesting that either.

 Gray:NFS over ZFS over iSCSI over ZFS over disk

 Marion: pNFS over ZFS over disk

they are both using the same amount of {,p}NFS.


pgp2sQIXdWVEA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Enable compression on ZFS root

2008-10-16 Thread Vincent Fox

 No, the last arguments are not options.nbsp;
 Unfortunately,br
 the syntax doesn't provide a way to specify
 compressionbr
 at the creation time.nbsp; It should, though.nbsp;
 Or perhapsbr
 compression should be the default.br

Should I submit an RFE somewhere then?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread David Magda

On Oct 16, 2008, at 15:20, Marion Hakanson wrote:

 For the stated usage of the original poster, I think I would aim  
 toward
 turning each of the Thumpers into an NFS server, configure the head- 
 node
 as a pNFS/NFSv4.1

It's a shame that Lustre isn't available on Solaris yet either.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] HELP! SNV_97, 98, 99 zfs with iscsitadm and VMWare!

2008-10-16 Thread Nigel Smith

I googled on some sub-strings from your ESX logs
and found these threads on the VmWare forum 
which lists similar error messages,
 suggests some actions to try on the ESX server:

http://communities.vmware.com/message/828207

Also, see this thread:

http://communities.vmware.com/thread/131923

Are you using multiple Ethernet connections between the OpenSolaris box
and the ESX server?
Your 'iscsitadm list target -v' is showing Connections: 0,
so run that command after the  ESX server initiator has
successfully connected to the OpenSolaris iscsi target,
and post that output.
The log files seem to show the iscsi session has dropped out,
and the initiator is auto retrying to connect to the target, 
but failing. It may help to get a packet capture at this stage
to try  see why the logon is failing.
Regards
Nigel Smith
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson

[EMAIL PROTECTED] said:
 but Marion's is not really possible at all, and won't be for a while with
 other groups' choice of storage-consumer platform, so it'd have to be
 GlusterFS or some other goofy fringe FUSEy thing or not-very-general crude
 in-house hack. 

Well, of course the magnitude of fringe factor is in the eye of the beholder.
I didn't intend to make pNFS seem like a done deal.  I don't quite yet
think of OpenSolaris as a done deal either, still using Solaris-10 here
in production, but since this is an OpenSolaris mailing list I should be
more careful.

Anyway, from looking over the wiki/blog info, apparently the sticking
point with pNFS may be client-side availability -- there's only Linux and
(Open)Solaris NFSv4.1 clients just yet.  Still, pNFS claims to be backwards
compatible with NFS v3 clients:  If you point a traditional NFS client at
the pNFS metadata server, the MDS is supposed to relay the data from the
backend data servers.


[EMAIL PROTECTED] said:
 It's a shame that Lustre isn't available on Solaris yet either. 

Actually, that may not be so terribly fringey, either.  Lustre and Sun's
Scalable Storage product can make use of Thumpers:
http://www.sun.com/software/products/lustre/
http://www.sun.com/servers/cr/scalablestorage/

Apparently it's possible to have a Solaris/ZFS data-server for Lustre
backend storage:
http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU

I see they do not yet have anything other than Linux clients, so that's
a limitation.  But you can share out a Lustre filesystem over NFS, potentially
from multiple Lustre clients.  Maybe via CIFS/samba as well.

Lastly, I've considered the idea of using Shared-QFS to glue together
multiple Thumper-hosted ISCSI LUN's.  You could add shared-QFS clients
(acting as NFS/CIFS servers) if the client load needed more than one.
Then SAM-FS would be a possibility for backup/replication.

Anyway, I do feel that none of this stuff is quite there yet.  But my
experience with ZFS on fiberchannel SAN storage, that sinking feeling
I've had when a little connectivity glitch resulted in a ZFS panic,
makes me wonder if non-redundant ZFS on an ISCSI SAN is there yet,
either.  So far none of our lost-connection incidents resulted in pool
corruption, but we have only 4TB or so.  Restoring that much from tape
is feasible, but even if Gray's 150TB of data can be recreated, it would
take weeks to reload it.

If it's decided that the clustered-filesystem solutions aren't feasible yet,
the suggestion I've seen that I liked the best was Richard's, with a bad-boy
server SAS-connected to multiple J4500's.  But since Gray's project already
has the X4500's, I guess they'd have to find another use for them (:-).

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] HELP! SNV_97, 98, 99 zfs with iscsitadm and VMWare!

2008-10-16 Thread Ryan Arneson

Nigel Smith wrote:
 I googled on some sub-strings from your ESX logs
 and found these threads on the VmWare forum 
 which lists similar error messages,
  suggests some actions to try on the ESX server:

 http://communities.vmware.com/message/828207

 Also, see this thread:

 http://communities.vmware.com/thread/131923

 Are you using multiple Ethernet connections between the OpenSolaris box
 and the ESX server?
   
Indeed, I think there might be some notion of 2 separate interfaces. I 
see 0.0.0.0 and the 138.xx.xx.xx networks.

Oct 16 06:38:29 vmware-860-1 vmkernel: 0:02:03:00.166 cpu1:1080)iSCSI: bus 0 
target 40 trying to establish session 0x9a684e0 to portal 0, address 0.0.0.0 
port 3260 group 1

Oct 16 06:16:30 vmware-860-1 vmkernel: 0:01:41:01.021 cpu1:1076)iSCSI: bus 0 
target 38 established session 0x9a402c0 #1 to portal 0, address 138.23.117.32 
port 3260 group 1, alias luna


Do you have an active interface on the OpenSolaris box that is configured for 
0.0.0.0 right now? By default, since you haven't configured the tpgt on the 
iscsi target, solaris will broadcast all active interfaces in its SendTargets 
response. On the ESX side, ESX will attempt to log into all addresses in that 
SendTargets response, even though you may only put 1 address in the sw 
initiator config.

If that is the case, you have a few options

a) disable that bogus interface
b) fully configure it and and also create a vmkernel interface that can 
connect to it
c) configure a tpgt mask on the iscsi target (iscsitadm create tpgt) to 
only use the valid address

Also, I never see target 40 log into anything...is that still a valid 
target number?
You may want to delete everything in /var/lib/iscsi and reboot the host. 
The vmkbinding and vmkdiscovery files will be rebuilt and it will start 
over with target 0. Sometimes, things get a bit crufty.


-ryan

 Your 'iscsitadm list target -v' is showing Connections: 0,
 so run that command after the  ESX server initiator has
 successfully connected to the OpenSolaris iscsi target,
 and post that output.
 The log files seem to show the iscsi session has dropped out,
 and the initiator is auto retrying to connect to the target, 
 but failing. It may help to get a packet capture at this stage
 to try  see why the logon is failing.
 Regards
 Nigel Smith
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 200805 Grub problems

2008-10-16 Thread Mike Aldred

Ok, I managed to get my grub menu (and spashimage) back by following:

http://www.genunix.org/wiki/index.php/ZFS_rpool_Upgrade_and_GRUB

Initially, I just did it for the boot enviroment I wanted to use, but it didn't 
seem to work, so I also did it for the previous boot enviroment.  I'm not sure 
what it did, but it gave me the grub menu back (and splash image).  However, it 
would kernel panic (when trying to mount the zfs rpool I'm guessing).

I eventually followed the exact procedure for all my boot enviroments, just 
doing them in order from oldest to newest, I also didn't export rpool before I 
rebooted out of the LiveCD, and now it boots up fine.

Anyone know what's going on? I've had this happen to me twice now on seperate 
machines.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Enable compression on ZFS root

2008-10-16 Thread dick hoogendijk


Vincent Fox wrote:

 Or perhaps compression should be the default.

No way please! Things taking even more memory should never be the default.
An installation switch would be nice though.
Freedom of coice ;-)

-- 
Dick Hoogendijk -- PGP/GnuPG key: F86289CE
++ http://nagual.nl/ | SunOS 10u5 05/08 ++

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

44 matches

Mail list logo