[zfs-discuss] one time passwords - apache infrastructure incident report 8/28/2009

2009-09-04 Thread russell aspinwall
Hi,

Just be reading about apache.org incident report for 8/28/2009 
( https://blogs.apache.org/infra/entry/apache_org_downtime_report )

The use of Solaris and ZFS on the European server was interesting including the 
recovery.

However, what I found more interesting was the use of one time passwords which 
is supported by FreeBSD ( 
http://www.freebsd.org/doc/en/books/handbook/one-time-passwords.html ). 
Could or should this technology be incorporated into OpenSolaris?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Read about ZFS backup - Still confused

2009-09-04 Thread Thomas Burgess
Let me explain what i have and you decide if it's what you're looking for.
I run a home NAS based on ZFS (due to hardware issues i am using FreeBSD 7.2
as my os but all the data is on ZFS)
This system has multiple uses.  I have about 10 users and 4 HTPC's connected
via gigabit.  I have ZFS filesystems for Video, Audio and Data.

I have no problem using it for my main itune library or storing downloaded
and recorded video.  Each user also has thier own share to store data and
backups.

The system itself is made up of 3 raidz vdevs rights now, each with 4 1tb
hard drives so i have about 9 TB total space right now. Having a setup like
this sort of changes how you do things.  I have several computers, but all
the stuff i care about it on the NAS.  I am very happy with ZFS for this
purpose.  I originally used a linux backend with mdadm and xfs but i am very
much in love with my new system.  I love the ability to clone and snapshot
and i use it often.  It's already saved me from human error on 2 occasions.
It's also very fast.  I'm using cheap parts and have seen speeds over 250
MB/s, although i get around 30 MB/s per client average with samba.  for
streaming music and video it has never shuddered or skipped.  I have mostly
720p video but a large amount of 1080p as well.  It's not uncommon to have 3
htpc's streaming at the same time and 2 people using the network for other
stuff.i'm very happy with it.


I'm SURE you can find a method to backup/restore your data with ZFS.  Just
think of it more as a backend solution.  You'll still probably use whatever
method you're used to for transfering data, although i use a combination of
samba/nfs and even FTP.  If you're used to tar, no need to stop using it.
You might also look at rysnc.
You could set up a ZFS filesystem on the NAS and set up rsync on your
client, then set up automatic snapshots on the ZFS machine.  This way you'd
have multiple methods of restoring (you could just dump back the latest
rsync or you could clone one of the older snapshots and dump THAT back)




On Thu, Sep 3, 2009 at 4:58 PM, Cork Smith corkb...@sbcglobal.net wrote:

 Let me try rephrasing this. I would like the ability to restore so my
 system mirrors its state at the time when I backed it up given the old hard
 drive is now a door stop.

 Cork
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to find poor performing disks

2009-09-04 Thread Roch

Scott Lawson writes:
  Also you may wish to look at the output of 'iostat -xnce 1' as well.
  
  You can post those to the list if you have a specific problem.
  
  You want to be looking for error counts increasing and specifically 'asvc_t'
  for the service times on the disks. I higher number for asvc_t  may help to
   isolate poorly performing individual disks.
  
  

I blast the pool with dd, and look for drives that are
*always* active, while others in the same group have
completed their transaction group and get no more activity.
Within a group drives should be getting the same amount of
data per 5 second (zfs_txg_synctime) and the ones that are
always active are the ones slowing you down.

If whole groups are unbalanced that's a sign that they have
different amount of free space and the expectation is that
you will be gated by the speed on the group that needs to
catch up. 

-r

  
  Scott Meilicke wrote:
   You can try:
  
   zpool iostat pool_name -v 1
  
   This will show you IO on each vdev at one second intervals. Perhaps you 
   will see different IO behavior on any suspect drive.
  
   -Scott
 
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ARC limits not obeyed in OSol 2009.06

2009-09-04 Thread Roch

Do you have the zfs primarycache property on this release ?
if so, you could set it to 'metadata'  or none.

 primarycache=all | none | metadata

 Controls what is cached in the primary cache  (ARC).  If
 this  property  is set to all, then both user data and
 metadata is cached. If this property is set  to  none,
 then  neither  user data nor metadata is cached. If this
 property is set to metadata,  then  only  metadata  is
 cached. The default value is all.


-r


Udo Grabowski writes:
  Hi,
  we've capped Arcsize via set zfs:zfs_arc_max = 0x2000 in /etc/system to 
  512 MB, since ARC 
  still does not release memory when applications need it (this is another 
  bug). But this hard limit is 
  not obeyed, instead, when traversing all files in a large and deep 
  directory, we see the values below 
  (arc started with 300 MB). After a while, machine (Ultra 20 M2 with 6GB) 
  swaps and then, hours later, freezes completely (even no reaction on quick 
  push power button, no ping, no mouse, have to hard 
  reset). Arc summary shows clearly that limits are not what they supposed to 
  be. If this is working as
  intended, then the intention must be changed. As poorly as ARC is working 
  now, it's absolutely 
  necessary that a hard limit is indeed a hard limit for ARC. Please fix this. 
  Is there anything I can do to
  really limit or switch off the ARC completely ? It's breaking our production 
  work often since we've
  installed OSol (we came from SXDE 1.08 which worked better), we must find a 
  way to stop this 
  problem as fast as possible !
  
  arcstat:
  Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz c  
  13:22:16   95M   23M 24   10M   14   12M   64   22M   24   963M  536M  
  13:22:172K   256 10796   177   15   2229   965M  536M  
  13:22:182K   490 22   119   10   371   38   482   22   970M  536M  
  13:22:194K   214  4   1506643   1403   971M  536M  
  13:22:202K   427 19574   370   37   419   19   971M  536M  
  13:22:211K   208 19   103   17   105   21   202   19   971M  536M  
  
  13:23:161K   481 27808   401   47   478   27 1G  536M  
  13:23:172K   255 11   125   10   130   13   218   10 1G  536M  
  and counting...
  arc_summary:
  System Memory:
   Physical RAM:  6134 MB
   Free Memory :  1739 MB
   LotsFree:  95 MB
  
  ZFS Tunables (/etc/system):
   set zfs:zfs_arc_max = 0x2000
  
  ARC Size:
   Current Size: 1357 MB (arcsize)
   Target Size (Adaptive):   512 MB (c)
   Min Size (Hard Limit):191 MB (zfs_arc_min)
   Max Size (Hard Limit):512 MB (zfs_arc_max)
  
  ARC Size Breakdown:
   Most Recently Used Cache Size:  93%479 MB (p)
   Most Frequently Used Cache Size: 6%32 MB (c-p)
  
  ARC Efficency:
   Cache Access Total: 97131108
   Cache Hit Ratio:  75%   7321   [Defined State for 
  buffer]
   Cache Miss Ratio: 24%   23886667   [Undefined State for 
  Buffer]
   REAL Hit Ratio:   67%   65874421   [MRU/MFU Hits Only]
  
   Data Demand   Efficiency:66%
   Data Prefetch Efficiency: 8%
  
  CACHE HITS BY CACHE LIST:
Anon:   --%Counter Rolled.
Most Recently Used: 15%11463028 (mru) [ 
  Return Customer ]
Most Frequently Used:   74%54411393 (mfu) [ 
  Frequent Customer ]
Most Recently Used Ghost:   10%7537123 (mru_ghost)[ 
  Return Customer Evicted, Now Back ]
Most Frequently Used Ghost: 19%14619417 (mfu_ghost)   [ 
  Frequent Customer Evicted, Now Back ]
  CACHE HITS BY DATA TYPE:
Demand Data: 3%2716192 
Prefetch Data:   0%3506 
Demand Metadata:86%63089419 
Prefetch Metadata:  10%7435324 
  CACHE MISSES BY DATA TYPE:
Demand Data: 5%1365132 
Prefetch Data:   0%36544 
Demand Metadata:40%9664064 
Prefetch Metadata:  53%12820927
  -- 
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand
Bill Moore Bill.Moore at sun.com writes:
 
 Moving on, modern high-capacity SATA drives are in the 100-120MB/s
 range.  Let's call it 125MB/s for easier math.  A 5-port port multiplier
 (PM) has 5 links to the drives, and 1 uplink.  SATA-II speed is 3Gb/s,
 which after all the framing overhead, can get you 300MB/s on a good day.
 So 3 drives can more than saturate a PM.  45 disks (9 backplanes at 5
 disks + PM each) in the box won't get you more than about 21 drives
 worth of performance, tops.  So you leave at least half the available
 drive bandwidth on the table, in the best of circumstances.  That also
 assumes that the SiI controllers can push 100% of the bandwidth coming
 into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting
 close to a 4x PCIe-gen2 slot.

Wrong. The theoretical bandwidth of an x4 PCI-E v2.0 slot is 2GB/s per
direction (5Gbit/s before 8b-10b encoding per lane, times 0.8, times 4),
amply sufficient to deal with 600MB/s.

However they don't have this kind of slot, they have x2 PCI-E v1.0
slots (500MB/s per direction). Moreover SiI3132 default to a
MAX_PAYLOAD_SIZE of 128 bytes therefore my guess is that each 2-port
SATA card is only able to provide 60% of the theoretical throughput[1],
or about 300MB/s.

Then they have 3 such cards: total throughput of 900MB/s.

Finally the 4th SATA card (with 4 ports) is in a 32-bit 33MHz PCI slot
(not PCI-E). In practice such a bus can only provide a usable throughput
of about 100MB/s (out of 133MB/s theoretical).

All the bottlenecks are obviously the PCI-E links and the PCI bus.
So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
is that the max I/O throughput when reading from all the disks on
1 of their storage pod is about 1000MB/s. This is poor compared to
a Thumper for example, but the most important factor for them was
GB/$, not GB/sec. And they did a terrific job at that!

 And I'd re-iterate what myself and others have observed about SiI and
 silent data corruption over the years.

Irrelevant, because it seems they have built fault-tolerance higher in
the stack, à la Google. Commodity hardware + reliable software = great
combo.

[1] 
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand
Marc Bevand m.bevand at gmail.com writes:
 
 So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
 is that the max I/O throughput when reading from all the disks on
 1 of their storage pod is about 1000MB/s.

Correction: the SiI3132 are on x1 (not x2) links, so my guess as to
the aggregate throughput when reading from all the disks is:
3*150+100 = 550MB/s.
(150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link)

And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards
to exploit closer to the max theoretical bandwidth of an x1 PCI-E
link, it would be:
3*250+100 = 850MB/s.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Roch

100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. 

This indicates that the bandwidth you're able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers.

-r

David Bond writes:
  Hi,
  
  I was directed here after posting in CIFS discuss (as i first thought that 
  it could be a CIFS problem).
  
  I posted the following in CIFS:
  
  When using iometer from windows to the file share on opensolaris
  svn101 and svn111 I get pauses every 5 seconds of around 5 seconds
  (maybe a little less) where no data is transfered, when data is
  transfered it is at a fair speed and gets around 1000-2000 iops with 1
  thread (depending on the work type). The maximum read response time is
  200ms and the maximum write response time is 9824ms, which is very
  bad, an almost 10 seconds delay in being able to send data to the
  server. 
  This has been experienced on 2 test servers, the same servers have
  also been tested with windows server 2008 and they havent shown this
  problem (the share performance was slightly lower than CIFS, but it
  was consistent, and the average access time and maximums were very
  close. 
  
  
  I just noticed that if the server hasnt hit its target arc size, the
  pauses are for maybe .5 seconds, but as soon as it hits its arc
  target, the iops drop to around 50% of what it was and then there are
  the longer pauses around 4-5 seconds. and then after every pause the
  performance slows even more. So it appears it is definately server
  side. 
  
  This is with 100% random io with a spread of 33% write 66% read, 2KB
  blocks. over a 50GB file, no compression, and a 5.5GB target arc
  size. 
  
  
  
  Also I have just ran some tests with different IO patterns and 100
  sequencial writes produce and consistent IO of 2100IOPS, except when
  it pauses for maybe .5 seconds every 10 - 15 seconds. 
  
  100% random writes produce around 200 IOPS with a 4-6 second pause
  around every 10 seconds. 
  
  100% sequencial reads produce around 3700IOPS with no pauses, just
  random peaks in response time (only 16ms) after about 1 minute of
  running, so nothing to complain about. 
  
  100% random reads produce around 200IOPS, with no pauses. 
  
  So it appears that writes cause a problem, what is causing these very
  long write delays? 
  
  A network capture shows that the server doesnt respond to the write
  from the client when these pauses occur. 
  
  Also, when using iometer, the initial file creation doesnt have and
  pauses in the creation, so it  might only happen when modifying
  files. 
  
  Any help on finding a solution to this would be really appriciated.
  
  David
  -- 
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Change the volblocksize of a ZFS volume

2009-09-04 Thread Roch

stuart anderson writes:
 Question :

 Is there a way to change the volume blocksize
   say
via 'zfs snapshot send/receive'?

 As I see things, this isn't possible as the
   target
volume (including property values) gets
   overwritten
by 'zfs receive'.
   

By default, properties are not received.  To pass
properties, you need 
to use
the -R flag.
   
   I have tried that, and while it works for properties
   like compression, I have not found a way to preserve
   a non-default volblocksize across zfs send | zfs
   receive. the zvol created on the receive side is
   always defaulting to 8k. Is there a way to do this?
   
  
  I spoke too soon. More particularly, during the zfs send/recv
  processes the receiving side reports 8k, but once the receive is done
  the volblocksize is reporting the expected value as sent with -R. 
  
  Hopefully, this is just a reporting bug during an active receive.
  
  Note, this was observed with s10u7 (x86).
  

Sounds like so.

I would be very surprised if one would be able to change the
volblocksize of a zvol through send/receive (with or without
-R). It's an immutable property of the zvol.

-r


  Thanks.
  -- 
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread dick hoogendijk

Lori Alt wrote:
The -n option does some verification.  It verifies that the record 
headers distributed throughout the stream are syntactically valid.  
Since each record header contains a length field which allows the next 
header to be found, one bad header will cause the processing of the 
stream to abort.  But it doesn't verify the content of the data 
associated with each record.


So, storing the stream in a zfs received filesystem is the better 
option. Alas, it also is the most difficult one. Storing to a file with 
zfs send -Rv is easy. The result is just a file and if your reboot the 
system all is OK. However, if I zfs receive -Fdu into a zfs filesystem 
I'm in trouble when I reboot the system. I get confusion on mountpoints! 
Let me explain:


Some time ago I backup up my rpool and my /export ; /export/home to 
/backup/snaps (with  zfs receive -Fdu). All's OK because the newly 
created zfs FS's stay unmounted 'till the next reboot(!). When I 
rebooted my system (due to a kernel upgrade) the system would nog boot, 
because it had mounted the zfs FS backup/snaps/export on /export and 
backup/snaps/export/home on /export/home. The system itself had those 
FS's too, of course. So, there was a mix up. It would be nice if the 
backup FS's would not be mounted (canmount=noauto), but I cannot give 
this option when I create the zfs send | receive, can I? And giving this 
option later on is very difficult, because canmount is NOT recursive! 
And I don't want to set it manualy on all those backup up FS's.


I wonder how other people overcome this mountpoint issue.

--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 b122
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke
Roch Bourbonnais Wrote:
100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. 

This indicates that the bandwidth you're able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers.

When I have tested using 50% reads, 60% random using iometer over NFS, I can 
see the data going straight to disk due to the sync nature of NFS. But I also 
see writes coming to a stand still every 10 seconds or so, which I have 
attributed to the ZIL dumping to disk. Therefore I conclude that it is the 
process of dumping the ZIL to disk that (mostly?) blocks writes during the 
dumping. I do agree with Bob and others that suggest making the size of the 
dump smaller will mask this behavior, and that seems like a good idea, although 
I have not yet tried and tested it myself.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread Lori Alt

On 09/04/09 09:41, dick hoogendijk wrote:

Lori Alt wrote:
The -n option does some verification.  It verifies that the record 
headers distributed throughout the stream are syntactically valid.  
Since each record header contains a length field which allows the 
next header to be found, one bad header will cause the processing of 
the stream to abort.  But it doesn't verify the content of the data 
associated with each record.


So, storing the stream in a zfs received filesystem is the better 
option. Alas, it also is the most difficult one. Storing to a file 
with zfs send -Rv is easy. The result is just a file and if your 
reboot the system all is OK. However, if I zfs receive -Fdu into a 
zfs filesystem I'm in trouble when I reboot the system. I get 
confusion on mountpoints! Let me explain:


Some time ago I backup up my rpool and my /export ; /export/home to 
/backup/snaps (with  zfs receive -Fdu). All's OK because the newly 
created zfs FS's stay unmounted 'till the next reboot(!). When I 
rebooted my system (due to a kernel upgrade) the system would nog 
boot, because it had mounted the zfs FS backup/snaps/export on 
/export and backup/snaps/export/home on /export/home. The system 
itself had those FS's too, of course. So, there was a mix up. It would 
be nice if the backup FS's would not be mounted (canmount=noauto), but 
I cannot give this option when I create the zfs send | receive, can I? 
And giving this option later on is very difficult, because canmount 
is NOT recursive! And I don't want to set it manualy on all those 
backup up FS's.


I wonder how other people overcome this mountpoint issue.

The -u option to zfs recv (which was just added to support flash archive 
installs, but it's useful for other reasons too) suppresses all mounts 
of the received file systems.  So you can mount them yourself afterward 
in whatever order is appropriate, or do a 'zfs mount -a'.


lori



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] question about my hardware choice

2009-09-04 Thread Eugen Leitl

Hi zfs congnoscenti,

a few quick question about my hardware choice (a bit late, since the
box is up already):

A 3U supermicro chassis with 16x SATA/SAS hotplug
Supermicro X8DDAi (2x Xeon QC 1.26 GHz S1366, 24 GByte RAM, IPMI)
2x LSI SAS3081E-R
16x WD2002FYPS

Right now I'm running Solaris 10 5/9 (Oracle doesn't support
OpenSolaris, unfortunately).

I would like to run Oracle in a zone/container, and use the rest for
random storage and network servage.

My questions:

* does the hardware choice make sense? Particularly, the LSI host adapters.
  should I change anything hardware-side?

* what kind of zfs layout would you recommend if I want to run Oracle in a 
container?

* should I put some SSD (e.g. Intel 80 GByte 2nd gen) into the system if I can,
  or doesn't Solaris 10 5/9 zfs support it?

* is there a reason speaking against containers and Oracle?

* how many hot spares would you suggest?

Thanks.

-- 
Eugen* Leitl a href=http://leitl.org;leitl/a http://leitl.org
__
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Archiving and Restoring Snapshots

2009-09-04 Thread Richard Elling

On Sep 3, 2009, at 10:32 PM, Tim Cook wrote:


On Fri, Sep 4, 2009 at 12:17 AM, Ross myxi...@googlemail.com wrote:
Hi Richard,

Actually, reading your reply has made me realise I was overlooking  
something when I talked about tar, star, etc...  How do you backup a  
ZFS volume?  That's something traditional tools can't do.  Are  
snapshots the only way to create a backup or archive of those?


Below the application, dd would do it.  But if you want incrementals,  
then

either use the application's backup scheme or zfs send.

Personally I'm quite happy with snapshots - we have a ZFS system at  
work that's replicating all of it's data to an offsite ZFS store  
using snapshots.  Using ZFS as a backup store is something I'm quite  
happy with, it's just storing just a snapshot file that makes me  
nervous.


The correct answer is ndmp.  Whether Sun will ever add it to  
opensolaris is another subject entirely though.


Available since b78 with source Integrated in b102.
http://www.opensolaris.org/os/project/ndmp/

But NDMP is just part of an overall data management architecture...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread dick hoogendijk

Lori Alt wrote:
The -u option to zfs recv (which was just added to support flash 
archive installs, but it's useful for other reasons too) suppresses 
all mounts of the received file systems.  So you can mount them 
yourself afterward in whatever order is appropriate, or do a 'zfs 
mount -a'.
You misunderstood my problem. It is very convenient that the filesystems 
are not mounted. I only wish they could stay that way!. Alas, they ARE 
mounted (even if I don't want them to) when I  *reboot* the system. And 
THAT's when thing get ugly. I then have different zfs filesystems using 
the same mountpoints! The backed up ones have the same mountpoints as 
their origin :-/  - The only way to stop it is to *export* the backup 
zpool OR to change *manualy* the zfs prop canmount=noauto in all 
backed up snapshots/filesystems.


As I understand I cannot give this canmount=noauto to the zfs receive 
command.

# zfs send -Rv rp...@0909 | zfs receive -Fdu backup/snaps

--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 B121
+ All that's really worth doing is what we do for others (Lewis Carrol)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Neil Perrin



On 09/04/09 09:54, Scott Meilicke wrote:

Roch Bourbonnais Wrote:
100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. 

This indicates that the bandwidth you're able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers.


When I have tested using 50% reads, 60% random using iometer over NFS,
I can see the data going straight to disk due to the sync nature of NFS.
But I also see writes coming to a stand still every 10 seconds or so,
which I have attributed to the ZIL dumping to disk. Therefore I conclude
that it is the process of dumping the ZIL to disk that (mostly?) blocks
writes during the dumping.


The ZIL does does not work like that. It is not a journal.

Under a typical write load write transactions are batched and
written out in a group transaction (txg). This txg sync occurs
every 30s under light load but more frequently or continuously
under heavy load.

When writing synchronous data (eg NFS) the transactions get written immediately
to the intent log and are made stable. When the txg later commits the
intent log blocks containing those committed transactions can be
freed. So as you can see there is no periodic dumping of
the ZIL to disk. What you are probably observing is the periodic txg
commit.

Hope that helps: Neil. 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Understanding when (and how) ZFS will use spare disks

2009-09-04 Thread Scott Meilicke
This sounds like the same behavior as opensolaris 2009.06. I had several disks 
recently go UNAVAIL, and the spares did not take over. But as soon as I 
physically removed a disk, the spare started replacing the removed disk. It 
seems UNAVAIL is not the same as the disk not being there. I wish the spare 
*would* take over in these cases, since the pool is degraded.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Eric Sproul
Scott Meilicke wrote:
 So what happens during the txg commit?
 
 For example, if the ZIL is a separate device, SSD for this example, does it 
 not work like:
 
 1. A sync operation commits the data to the SSD
 2. A txg commit happens, and the data from the SSD are written to the 
 spinning disk

#1 is correct.  #2 is incorrect.  The TXG commit goes from memory into the main
pool.  The SSD data is simply left there in case something bad happens before
the TXG commit succeeds.  Once it succeeds, then the SSD data can be 
overwritten.

The only time you need to read from a ZIL device is if a crash occurs and you
need those blocks to repair the pool.

Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke
So what happens during the txg commit?

For example, if the ZIL is a separate device, SSD for this example, does it not 
work like:

1. A sync operation commits the data to the SSD
2. A txg commit happens, and the data from the SSD are written to the spinning 
disk

So this is two writes, correct?

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Understanding when (and how) ZFS will use spare disks

2009-09-04 Thread Chris Siebenmann
 We have a number of shared spares configured in our ZFS pools, and
we're seeing weird issues where spares don't get used under some
circumstances.  We're running Solaris 10 U6 using pools made up of
mirrored vdevs, and what I've seen is:

* if ZFS detects enough checksum errors on an active disk, it will
  automatically pull in a spare.
* if the system reboots without some of the disks available (so that
  half of the mirrored pairs drop out, for example), spares will *not*
  get used. ZFS recognizes that the disks are not there; they are marked
  as UNAVAIL and the vdevs (and pools) as DEGRADED, but it doesn't try to
  use spares.

(This is in a SAN environment where half of all of the mirrors come
from one controller and half come from another one.)

 All of this makes me think that I don't understand how ZFS spares
really work, and under what circumstances they'll get used. Does
anyone know if there's a writeup of this somewhere?

(What I've gathered so far from reading zfs-discuss archives is that
ZFS spares are not handled automatically in the kernel code but are
instead deployed to pools by a fmd ZFS management module[*], doing more
or less 'zpool repace pool failing-dev spare' (presumably through
an internal code path, since 'zpool history' doesn't seem to show spare
deployment). Is this correct?)

 Also, searching turns up some old zfs-discuss messages suggesting that
not bringing in spares in response to UNAVAIL disks was a bug that's now
fixed in at least OpenSolaris. If so, does anyone know if the fix has
made it into S10 U7 (or is planned or available as a patch)?

 Thanks in advance.

- cks
[*: http://blogs.sun.com/eschrock/entry/zfs_hot_spares suggests that
it is 'zfs-retire', which is separate from 'zfs-diagnosis'.]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke
Doh! I knew that, but then forgot...

So, for the case of no separate device for the ZIL, the ZIL lives on the disk 
pool. In which case, the data are written to the pool twice during a sync:

1. To the ZIL (on disk) 
2. From RAM to disk during tgx

If this is correct (and my history in this thread is not so good, so...), would 
that then explain some sort of pulsing write behavior for sync write operations?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn

On Fri, 4 Sep 2009, Scott Meilicke wrote:


So what happens during the txg commit?

For example, if the ZIL is a separate device, SSD for this example, does it not 
work like:

1. A sync operation commits the data to the SSD
2. A txg commit happens, and the data from the SSD are written to the spinning 
disk

So this is two writes, correct?


From past descriptions, the slog is basically a list of pending write 
system calls.  The only time the slog is read is after a reboot. 
Otherwise, the slog is simply updated as write operations proceed.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke
So, I just re-read the thread, and you can forget my last post. I had thought 
the argument was that the data were not being written to disk twice (assuming 
no separate device for the ZIL), but it was just explaining to me that the data 
are not read from the ZIL to disk, but rather from memory to disk. I need more 
coffee...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Tim Cook
On Fri, Sep 4, 2009 at 5:36 AM, Marc Bevand m.bev...@gmail.com wrote:

 Marc Bevand m.bevand at gmail.com writes:
 
  So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
  is that the max I/O throughput when reading from all the disks on
  1 of their storage pod is about 1000MB/s.

 Correction: the SiI3132 are on x1 (not x2) links, so my guess as to
 the aggregate throughput when reading from all the disks is:
 3*150+100 = 550MB/s.
 (150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link)

 And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards
 to exploit closer to the max theoretical bandwidth of an x1 PCI-E
 link, it would be:
 3*250+100 = 850MB/s.

 -mrb



Whats the point of arguing what the back-end can do anyways?  This is bulk
data storage.  Their MAX input is ~100MB/sec.  The backend can more than
satisfy that.  Who cares at that point whether it can push 500MB/s or
5000MB/s?  It's not a database processing transactions.  It only needs to be
able to push as fast as the front-end can go.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Kyle McDonald

Scott Meilicke wrote:

I am still not buying it :) I need to research this to satisfy myself.

I can understand that the writes come from memory to disk during a txg write 
for async, and that is the behavior I see in testing.

But for sync, data must be committed, and a SSD/ZIL makes that faster because 
you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data 
on the SSD must get to spinning disk.

  
But the txg (which may contain more data than just the sync data that 
was written to the ZIL) is still written from memory. Just because the 
sync data was written to the ZIL, doesn't mean it's not still in memory.


 -Kyle


To the books I go!

-Scott
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Len Zaifman
We have groups generating terabytes a day of image data  from lab instruments 
and saving them to an X4500.

We have tried lzbj : compressratio = 1.13 in 11 hours , 1.3 TB - 1.1 TB
gzip -9 : compress ratio = 1.68 in  37 hours, 1.3 TB - 
.75 TB

The filesystem performance was noticably laggy (ie ls took  10 seconds) while 
gzip -9 compression was used

do you have any idea if lossless jpeg compression is being planned for ZFS? We 
can envisage of 1.3 TB,  .8 TB will be images and if we could get better or 
equivalent compression on jpeg lossless compression with less impact on the 
filesystem than gzip -9 compression, that would be worthwhile, if it worked.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] PMP support in Opensolaris

2009-09-04 Thread Brandon High
On Wed, Sep 2, 2009 at 4:56 PM, David Magdadma...@ee.ryerson.ca wrote:
 Said support was committed only two to three weeks ago:

 PSARC/2009/394 SATA Framework Port Multiplier Support
 6422924 sata framework has to support port multipliers
 6691950 ahci driver needs to support SIL3726/4726 SATA port multiplier

When is this going to show up in the repo at
http://pkg.opensolaris.org/dev/ ? Is it already there?

Sorry if it's a dumb question, but I'm not sure where to look so the
release process is a bit opaque to me.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker
On Sep 4, 2009, at 2:22 PM, Scott Meilicke scott.meili...@craneaerospace.com 
 wrote:


So, I just re-read the thread, and you can forget my last post. I  
had thought the argument was that the data were not being written to  
disk twice (assuming no separate device for the ZIL), but it was  
just explaining to me that the data are not read from the ZIL to  
disk, but rather from memory to disk. I need more coffee...


I think your confusing ARC write-back with ZIL and it isn't the sync  
writes that are blocking IO it's the async writes that have been  
cached and are now being flushed.


Just tell ARC to cache less IO for your hardware with the kernel  
config Bob mentioned way back.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke
Yes, I was getting confused. Thanks to you (and everyone else) for clarifying.

Sync or async, I see the txg flushing to disk starve read IO.

Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Richard Elling


On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:

We have groups generating terabytes a day of image data  from lab  
instruments and saving them to an X4500.


Wouldn't it be easier to compress at the application, or between the
application and the archiving file system?

We have tried lzbj : compressratio = 1.13 in 11 hours , 1.3 TB -  
1.1 TB
   gzip -9 : compress ratio = 1.68 in  37 hours,  
1.3 TB - .75 TB


The filesystem performance was noticably laggy (ie ls took  10  
seconds) while gzip -9 compression was used


do you have any idea if lossless jpeg compression is being planned  
for ZFS? We can envisage of 1.3 TB,  .8 TB will be images and if we  
could get better or equivalent compression on jpeg lossless  
compression with less impact on the filesystem than gzip -9  
compression, that would be worthwhile, if it worked.


I don't know of anyone working on that specific compression scheme,
but I've put together some thoughts on the subject of adding a new
compressor to ZFS.  Perhaps others could comment?

http://richardelling.blogspot.com/2009/08/justifying-new-compression-algorithms.html
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker
On Sep 4, 2009, at 4:33 PM, Scott Meilicke scott.meili...@craneaerospace.com 
 wrote:


Yes, I was getting confused. Thanks to you (and everyone else) for  
clarifying.


Sync or async, I see the txg flushing to disk starve read IO.


Well try the kernel setting and see how it helps.

Honestly though if you can say it's all sync writes with certainty and  
IO is still blocking, you need a better storage sub-system, or an  
additional pool.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Louis-Frédéric Feuillette
On Fri, 2009-09-04 at 13:41 -0700, Richard Elling wrote:
 On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:
 
  We have groups generating terabytes a day of image data  from lab  
  instruments and saving them to an X4500.
 
 Wouldn't it be easier to compress at the application, or between the
 application and the archiving file system?

Preamble:  I am actively doing research into image set compression,
specifically jpeg2000, so this is my point of reference.


I think it would be easier to compress at the application level. I would
suggest getting the image from the source, then use lossless jpeg2000
compression on it, saving the result to an uncompressed ZFS pool.

JPEG2000 uses arithmetic encoding to do the final compression step.
Arithmetic encoding has a higher compression rate (in general) than
gzip-9, lzbj or others.  There is an opensource implementation of
jpeg2000 called jasper[1].  Jasper is the reference implementation for
jpeg2000, meaning that all other jpeg2000 programs must verify it's
output to that of jasper (kinda).

Saving the jpeg2000 image to an uncompressed ZFS partition will be the
fastest thing.  Since jpeg2000 is already compressed, trying to compress
it will not yeild any storage space reduction, in fact it may _increase_
the size of the data stored on disk.  Since good compression algorithms
result in random data you can see why running on a compressed pool would
be bad for performance.

[1] http://www.ece.uvic.ca/~mdadams/jasper

On a side note, if you want to know how Arithmetic encoding works,
Wikipedia[2] has a real nice explanation.  Suffice it to say, in theory
( Without considering implementation details ) arithmetic encoding can
encode _any_ data at the rate of data_entropy*num_of_symbols +
data_symbol_table. In practice this doesn't happen due to floating point
overflows and some other issues.

[2] http://en.wikipedia.org/wiki/Arithmetic_coding

-- 
Louis-Frédéric Feuillette jeb...@gmail.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke
I only see the blocking while load testing, not during regular usage, so I am 
not so worried. I will try the kernel settings to see if that helps if/when I 
see the issue in production. 

For what it is worth, here is the pattern I see when load testing NFS (iometer, 
60% random, 65% read, 8k chunks, 32 outstanding I/Os):

data01  59.6G  20.4T 46 24   757K  3.09M
data01  59.6G  20.4T 39 24   593K  3.09M
data01  59.6G  20.4T 45 25   687K  3.22M
data01  59.6G  20.4T 45 23   683K  2.97M
data01  59.6G  20.4T 33 23   492K  2.97M
data01  59.6G  20.4T 16 41   214K  1.71M
data01  59.6G  20.4T  3  2.36K  53.4K  30.4M
data01  59.6G  20.4T  1  2.23K  20.3K  29.2M
data01  59.6G  20.4T  0  2.24K  30.2K  28.9M
data01  59.6G  20.4T  0  1.93K  30.2K  25.1M
data01  59.6G  20.4T  0  2.22K  0  28.4M
data01  59.7G  20.4T 21295   317K  4.48M
data01  59.7G  20.4T 32 12   495K  1.61M
data01  59.7G  20.4T 35 25   515K  3.22M
data01  59.7G  20.4T 36 11   522K  1.49M
data01  59.7G  20.4T 33 24   508K  3.09M

LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Nicolas Williams
On Fri, Sep 04, 2009 at 01:41:15PM -0700, Richard Elling wrote:
 On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:
 We have groups generating terabytes a day of image data  from lab  
 instruments and saving them to an X4500.
 
 Wouldn't it be easier to compress at the application, or between the
 application and the archiving file system?

Especially when it comes to reading the images back!

ZFS compression is transparent.  You can't write uncompressed data then
read back compressed data.  And compression is at the block level, not
for the whole file, so even if you could read it back compressed, it
wouldn't be in a useful format.

Most people want to transfer data compressed, particularly images.  So
compressing at the application level in this case seems best to me.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread Lori Alt

On 09/04/09 10:17, dick hoogendijk wrote:

Lori Alt wrote:
The -u option to zfs recv (which was just added to support flash 
archive installs, but it's useful for other reasons too) suppresses 
all mounts of the received file systems.  So you can mount them 
yourself afterward in whatever order is appropriate, or do a 'zfs 
mount -a'.
You misunderstood my problem. It is very convenient that the 
filesystems are not mounted. I only wish they could stay that way!. 
Alas, they ARE mounted (even if I don't want them to) when I  *reboot* 
the system. And THAT's when thing get ugly. I then have different zfs 
filesystems using the same mountpoints! The backed up ones have the 
same mountpoints as their origin :-/  - The only way to stop it is to 
*export* the backup zpool OR to change *manualy* the zfs prop 
canmount=noauto in all backed up snapshots/filesystems.


As I understand I cannot give this canmount=noauto to the zfs 
receive command.

# zfs send -Rv rp...@0909 | zfs receive -Fdu backup/snaps
There is a RFE to allow zfs recv to assign properties, but I'm not sure 
whether it would help in your case.  I would have thought that 
canmount=noauto would have already been set on the sending side, 
however.  In that case, the property should be preserved when the stream 
is preserved.  But if for some reason, you're not setting that property 
on the sending side, but want it set on the receiving side, you might 
have to write a script to set the properties for all those datasets 
after they are received.


lori

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Bob Friesenhahn

On Fri, 4 Sep 2009, Louis-Frédéric Feuillette wrote:


JPEG2000 uses arithmetic encoding to do the final compression step.
Arithmetic encoding has a higher compression rate (in general) than
gzip-9, lzbj or others.  There is an opensource implementation of
jpeg2000 called jasper[1].  Jasper is the reference implementation for
jpeg2000, meaning that all other jpeg2000 programs must verify it's
output to that of jasper (kinda).


Jasper is incredibly slow and consumes large amount of memory.  Other 
JPEG2000 programs are validated by how many times faster they are than 
Jasper. :-)


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn

On Fri, 4 Sep 2009, Scott Meilicke wrote:

I only see the blocking while load testing, not during regular 
usage, so I am not so worried. I will try the kernel settings to see 
if that helps if/when I see the issue in production.


The flipside of the pulsing is that the deferred writes dimish 
contention for precious read IOPs and quite a few programs have a 
habit of updating/rewriting a file over and over again.  If the file 
is completely asynchronously rewritten once per second and zfs writes 
a transaction group every 30 seconds, then 29 of those updates avoided 
consuming write IOPs.  Another benefit is that if zfs has more data in 
hand to write, then it can do a much better job of avoiding 
fragmentation, avoid unnecessary COW by diminishing short tail writes, 
and achieve more optimum write patterns.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker
On Sep 4, 2009, at 6:33 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Fri, 4 Sep 2009, Scott Meilicke wrote:

I only see the blocking while load testing, not during regular  
usage, so I am not so worried. I will try the kernel settings to  
see if that helps if/when I see the issue in production.


The flipside of the pulsing is that the deferred writes dimish  
contention for precious read IOPs and quite a few programs have a  
habit of updating/rewriting a file over and over again.  If the file  
is completely asynchronously rewritten once per second and zfs  
writes a transaction group every 30 seconds, then 29 of those  
updates avoided consuming write IOPs.  Another benefit is that if  
zfs has more data in hand to write, then it can do a much better job  
of avoiding fragmentation, avoid unnecessary COW by diminishing  
short tail writes, and achieve more optimum write patterns.


I guess one can find a silver lining in any grey cloud, but for myself  
I'd just rather see a more linear approach to writes. Anyway I have  
never seen any reads happen during these write flushes.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker
On Sep 4, 2009, at 5:25 PM, Scott Meilicke scott.meili...@craneaerospace.com 
 wrote:


I only see the blocking while load testing, not during regular  
usage, so I am not so worried. I will try the kernel settings to see  
if that helps if/when I see the issue in production.


For what it is worth, here is the pattern I see when load testing  
NFS (iometer, 60% random, 65% read, 8k chunks, 32 outstanding I/Os):


data01  59.6G  20.4T 46 24   757K  3.09M
data01  59.6G  20.4T 39 24   593K  3.09M
data01  59.6G  20.4T 45 25   687K  3.22M
data01  59.6G  20.4T 45 23   683K  2.97M
data01  59.6G  20.4T 33 23   492K  2.97M
data01  59.6G  20.4T 16 41   214K  1.71M
data01  59.6G  20.4T  3  2.36K  53.4K  30.4M
data01  59.6G  20.4T  1  2.23K  20.3K  29.2M
data01  59.6G  20.4T  0  2.24K  30.2K  28.9M
data01  59.6G  20.4T  0  1.93K  30.2K  25.1M
data01  59.6G  20.4T  0  2.22K  0  28.4M
data01  59.7G  20.4T 21295   317K  4.48M
data01  59.7G  20.4T 32 12   495K  1.61M
data01  59.7G  20.4T 35 25   515K  3.22M
data01  59.7G  20.4T 36 11   522K  1.49M
data01  59.7G  20.4T 33 24   508K  3.09M

LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM.


With that setup you'll see max 3x the IOPS of the type of disks, not  
really the kind of setup for 60% random workload. Assuming 2TB SATA  
drives the max IOPS would be around 240 IOPS.


Now if it were mirror vdevs you'd get 7x or 560 IOPS.

Is this for VMware or data warehousing?

You'll also need an SSD drive in the mix if your not using a  
controller with NVRAM write-back. Especially when sharing over NFS.


I guess since it's 15 drives it's an MD1000, I might have gone with  
the newer 2.5 drive enclosure as it holds 24 over 15 and most SSDs  
come in 2.5.


Since you got it already, invest in a PERC 6/E with 512MB of cache and  
stick it in the other PCIe 8x slot.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] PMP support in Opensolaris

2009-09-04 Thread Brandon High
On Fri, Sep 4, 2009 at 1:12 PM, Nigel
Smithnwsm...@wilusa.freeserve.co.uk wrote:
 Let us know if you can get the port multipliers working..

 But remember, there is a problem with ZFS raidz in that release, so be 
 careful:

I saw that, so I think I'll be waiting until snv_124 to update. The
system that I'm thinking of using currently only has mirrored vdevs
however, so it shouldn't be any risk.

Something like one of the following seems reasonable to add a few
drives to an existing system, although eSATA just seems like a bad
idea for a number of reasons:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816132016
http://www.newegg.com/Product/Product.aspx?Item=N82E16816111057

A good use that I can see is combining a Intel D945GCLF2 board with a
case that has more that 2 drive bays, using an internal PMP. One of
the systems I have is an Atom board in a small Chenbro 2-bay case,
which gives surprisingly good performance and is . There is a 4-bay
version available but lack of SATA ports on the motherboard kept me
from using it.

http://www.cooldrives.com/siseata5pomu.html
http://www.newegg.com/Product/Product.aspx?Item=N82E16811123122

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn

On Fri, 4 Sep 2009, Ross Walker wrote:


I guess one can find a silver lining in any grey cloud, but for myself I'd 
just rather see a more linear approach to writes. Anyway I have never seen 
any reads happen during these write flushes.


I have yet to see a read happen during the write flush either.  That 
impacts my application since it needs to read in order to proceed, and 
it does a similar amount of writes as it does reads.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker
On Sep 4, 2009, at 8:59 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Fri, 4 Sep 2009, Ross Walker wrote:


I guess one can find a silver lining in any grey cloud, but for  
myself I'd just rather see a more linear approach to writes. Anyway  
I have never seen any reads happen during these write flushes.


I have yet to see a read happen during the write flush either.  That  
impacts my application since it needs to read in order to proceed,  
and it does a similar amount of writes as it does reads.


The ARC makes it hard to tell if they are satisfied from cache or  
blocked due to writes.


I suppose if you have the hardware to go sync that might be the best  
bet. That and limiting the write cache.


Though I have only heard good comments from my ESX admins since moving  
the VMs off iSCSI and on to ZFS over NFS, so it can't be that bad.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn

On Fri, 4 Sep 2009, Ross Walker wrote:


I have yet to see a read happen during the write flush either.  That 
impacts my application since it needs to read in order to proceed, and it 
does a similar amount of writes as it does reads.


The ARC makes it hard to tell if they are satisfied from cache or blocked due 
to writes.


The existing prefetch bug makes it doubly hard. :-)

First I complained about the blocking reads, and then I complained 
about the blocking writes (presumed responsible for the blocking 
reads) and now I am waiting for working prefetch in order to feed my 
hungry application.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread David Magda

On Sep 4, 2009, at 21:44, Ross Walker wrote:

Though I have only heard good comments from my ESX admins since  
moving the VMs off iSCSI and on to ZFS over NFS, so it can't be that  
bad.


What's your pool configuration? Striped mirrors? RAID-Z with SSDs?  
Other?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker

On Sep 4, 2009, at 10:02 PM, David Magda dma...@ee.ryerson.ca wrote:


On Sep 4, 2009, at 21:44, Ross Walker wrote:

Though I have only heard good comments from my ESX admins since  
moving the VMs off iSCSI and on to ZFS over NFS, so it can't be  
that bad.


What's your pool configuration? Striped mirrors? RAID-Z with SSDs?  
Other?


Striped mirrors off NVRAM backed controller (Dell PERC 6/E).

RAID-Z isn't the best for many VMs as the whole vdev acts as single  
disk for random io.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Motherboard for home zfs/solaris file server

2009-09-04 Thread Tim Cook
On Thu, Sep 3, 2009 at 4:57 AM, Karel Gardas karel.gar...@centrum.czwrote:

 Hello,
 your (open)solaris for Ecc support (which seems to have been dropped from
 200906) is misunderstanding. OS 2009.06 also supports ECC as 2005 did. Just
 install it and use my updated ecccheck.pl script to get informed about
 errors. Also you might verify that Solaris' memory scrubber is really
 running if you are that curious:
 http://developmentonsolaris.wordpress.com/2009/03/06/how-to-make-sure-memory-scrubber-is-running/
 Karel
 --



Is there something that needs to be done on the solaris side for memscrub
scans to occur?  I'm running snv_118, with a supermicro board running ECC
memory and amd opteron CPU's.  It would appear it's doing a lot of nothing.

Aug  8 03:56:23 fserv unix: [ID 950921 kern.info] cpu0: x86 (chipid 0x0
AuthenticAMD 40F13 family 15 model 65 step 3 clock 2010 MHz)
Aug  8 03:56:23 fserv unix: [ID 950921 kern.info] cpu0: Dual-Core AMD
Opteron(tm) Processor 2212

r...@fserv:~# isainfo -v
64-bit amd64 applications
tscp ahf cx16 sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx
cmov
amd_sysc cx8 tsc fpu
32-bit i386 applications
tscp ahf cx16 sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx
cmov
amd_sysc cx8 tsc fpu




r...@fserv:~# echo memscrub_scans_done/U | mdb -k
memscrub_scans_done:
memscrub_scans_done:0
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] incremental send/recv larger than sum of snapshots?

2009-09-04 Thread Peter Baumgartner
I've been sending daily incrementals off-site for a while now, but
recently they failed so I had to send an incremental covering a number
of snapshots. I expected the incremental to be approximately the sum
of the snapshots, but it seems to be considerably larger and still
going. The source machine is nv72 and the destination is nv99. I
send/recv with this command:

/usr/sbin/zfs send -i tank/v...@2009-08-15 tank/v...@2009-08-26 | bzip2 -c
| ssh offsite-computer bzcat | /usr/sbin/zfs recv -F tank/vm

The sum of the 11 days of snapshots is about 100G, but I see the
remote computer registering over 130G. I'm pushing this over a single
T1, so the process has been running for about a week. Is this
expected? If so, is there anyway I can calculate how much data will
need to be transferred?

Here is a snippet of zfs list on the source:

tank/v...@2009-08-14   8.46G  -   440G  -
tank/v...@2009-08-15   7.49G  -   440G  -
tank/v...@2009-08-16   7.42G  -   440G  -
tank/v...@2009-08-17   7.45G  -   441G  -
tank/v...@2009-08-18   11.0G  -   538G  -
tank/v...@2009-08-19   11.1G  -   479G  -
tank/v...@2009-08-20   11.1G  -   479G  -
tank/v...@2009-08-21   7.61G  -   480G  -
tank/v...@2009-08-22   6.45G  -   481G  -
tank/v...@2009-08-23   7.31G  -   481G  -
tank/v...@2009-08-24   9.66G  -   481G  -
tank/v...@2009-08-25   10.1G  -   481G  -
tank/v...@2009-08-26   12.5G  -   481G  -


And the remote:

tank/v...@2009-08-148.46G  -   440G  -
tank/v...@2009-08-159.38G  -   440G  -
tank/vm/%2009-08-26136G   867G   475G
/tank/vm/%2009-08-26
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand
Tim Cook tim at cook.ms writes:
 
 Whats the point of arguing what the back-end can do anyways?  This is bulk 
data storage.  Their MAX input is ~100MB/sec.  The backend can more than 
satisfy that.  Who cares at that point whether it can push 500MB/s or 
5000MB/s?  It's not a database processing transactions.  It only needs to be 
able to push as fast as the front-end can go.  --Tim

True, what they have is sufficient to match GbE speed. But internal I/O 
throughput matters for resilvering RAID arrays, scrubbing, local data 
analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays per 
pod. If their layout is optimal they put 5 drives on the PCI bus (to minimize 
this number)  10 drives behind PCI-E links per array, so this means the PCI 
bus's ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per 
(1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of 
their arrays.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Tim Cook
On Sat, Sep 5, 2009 at 12:30 AM, Marc Bevand m.bev...@gmail.com wrote:

 Tim Cook tim at cook.ms writes:
 
  Whats the point of arguing what the back-end can do anyways?  This is
 bulk
 data storage.  Their MAX input is ~100MB/sec.  The backend can more than
 satisfy that.  Who cares at that point whether it can push 500MB/s or
 5000MB/s?  It's not a database processing transactions.  It only needs to
 be
 able to push as fast as the front-end can go.  --Tim

 True, what they have is sufficient to match GbE speed. But internal I/O
 throughput matters for resilvering RAID arrays, scrubbing, local data
 analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays
 per
 pod. If their layout is optimal they put 5 drives on the PCI bus (to
 minimize
 this number)  10 drives behind PCI-E links per array, so this means the
 PCI
 bus's ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per
 (1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of
 their arrays.

 -mrb


But none of that matters.  The data is replicated at a higher layer,
combined with raid-6.  They'd have to see triple disk failure across
multiple arrays at the same time...  They aren't concerned with performance,
the home users they're backing up aren't ever going to get anything remotely
close to gigE speeds.  Absolute BEST case scenario *MIGHT* push 20mbit if
the end-user is lucky enough to have FIOS or docsis 3.0 in their area, and
has large files with a clean link.

Even rebuilding two failed disks that setup will push 2MB/sec all day long.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss