date:20110207

Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-07 Thread Sandon Van Ness

I think as far as data integrity and complete volume loss is most likely 
in the following order:


1. 1x Raidz(7+1)
2. 2x RaidZ(3+1)
3. 1x Raidz2(6+2)

Simple raidz certainly is an option with only 8 disks (8 is about the 
maximum I would go) but to be honest I would feel safer going raidz2. 
The 2x raidz (3+1) would probably perform the best but I would prefer 
going with the 3rd option (raidz2) as it is better for redundancy. With 
raidz2 any two disks can fail and with dual parity if you get some 
un-recoverable read errors during a scrub you have a much better chance 
of not having corruption due to the double parity on the same set of data.


On 02/06/2011 06:45 PM, Matthew Angelo wrote:

I require a new high capacity 8 disk zpool.  The disks I will be
purchasing (Samsung or Hitachi) have an Error Rate (non-recoverable,
bits read) of 1 in 10^14 and will be 2TB.  I'm staying clear of WD
because they have the new 2048b sectors which don't play nice with ZFS
at the moment.

My question is, how do I determine which of the following zpool and
vdev configuration I should run to maximize space whilst mitigating
rebuild failure risk?

1. 2x RAIDZ(3+1) vdev
2. 1x RAIDZ(7+1) vdev
3. 1x RAIDZ2(7+1) vdev


I just want to prove I shouldn't run a plain old RAID5 (RAIDZ) with 8x
2TB disks.

Cheers
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote:

 Hi all,

 I'm trying to achieve the same effect of UFS directio on ZFS and here
 is what I did:

 Solaris UFS directio has three functions:
        1. improved async code path
        2. multiple concurrent writers
        3. no buffering

Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS.
But as I said, apprently 2.a) below didn't give me that. Do you have
any suggestion?

 Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing
 to set or change to take advantage of the feature.


 1. Set the primarycache of zfs to metadata and secondarycache to none,
 recordsize to 8K (to match the unit size of writes)
 2. Run my test program (code below) with different options and measure
 the running time.
 a) open the file without O_DSYNC flag: 0.11s.
 This doesn't seem like directio is in effect, because I tried on UFS
 and time was 2s. So I went on with more experiments with the O_DSYNC
 flag set. I know that directio and O_DSYNC are two different things,
 but I thought the flag would force synchronous writes and achieve what
 directio does (and more).

 Directio and O_DSYNC are two different features.

 b) open the file with O_DSYNC flag: 147.26s

 ouch

 c) same as b) but also enabled zfs_nocacheflush: 5.87s

 Is your pool created from a single HDD?
Yes, it is. Do you have an explanation for the b) case? I also tried
O_DSYNC AND directio on UFS, the time is on the same order as directio
but no O_DSYNC on UFS (see below). This dramatic difference between
UFS and ZFS is puzzling me...
UFS:  directio=on,no O_DSYNC - 2s          directio=on,O_DSYNC - 5s
ZFS:  no caching, no O_DSYNC - 0.11s     no caching, O_DSYNC - 147s


 My questions are:
 1. With my primarycache and secondarycache settings, the FS shouldn't
 buffer reads and writes anymore. Wouldn't that be equivalent to
 O_DSYNC? Why a) and b) are so different?

 No. O_DSYNC deals with when the I/O is committed to media.

 2. My understanding is that zfs_nocacheflush essentially removes the
 sync command sent to the device, which cancels the O_DSYNC flag. Why
 b) and c) are so different?

 No. Disabling the cache flush means that the volatile write buffer in the
 disk is not flushed. In other words, disabling the cache flush is in direct
 conflict with the semantics of O_DSYNC.

 3. Does ZIL have anything to do with these results?

 Yes. The ZIL is used for meeting the O_DSYNC requirements.  This has
 nothing to do with buffering. More details are on the ZFS Best Practices 
 Guide.
  -- richard


 Thanks in advance for any suggestion/insight!
 Yi


 #include fcntl.h
 #include sys/time.h

 int main(int argc, char **argv)
 {
   struct timeval tim;
   gettimeofday(tim, NULL);
   double t1 = tim.tv_sec + tim.tv_usec/100.0;
   char a[8192];
   int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660);
   //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660);
   if (argv[2][0] == '1')
       directio(fd, DIRECTIO_ON);
   int i;
   for (i=0; i1; ++i)
       pwrite(fd, a, sizeof(a), i*8192);
   close(fd);
   gettimeofday(tim, NULL);
   double t2 = tim.tv_sec + tim.tv_usec/100.0;
   printf(%f\n, t2-t1);
 }
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [?] - What is the recommended number of disks for a consumer PC with ZFS

2011-02-07 Thread Rob Clark

References:

Thread: ZFS effective short-stroking and connection to thin provisioning? 
http://opensolaris.org/jive/thread.jspa?threadID=127608

Confused about consumer drives and zfs can someone help?
http://opensolaris.org/jive/thread.jspa?threadID=132253

Recommended RAM for ZFS on various platforms
http://opensolaris.org/jive/thread.jspa?threadID=132072

Performance advantages of spool with 2x raidz2 vdevs vs. Single vdev - Spindles
http://opensolaris.org/jive/thread.jspa?threadID=132127
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Drive i/o anomaly

2011-02-07 Thread Matt Connolly

Hi, I have a low-power server with three drives in it, like so:


matt@vault:~$ zpool status
  pool: rpool
 state: ONLINE
 scan: resilvered 588M in 0h3m with 0 errors on Fri Jan  7 07:38:06 2011
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror-0ONLINE   0 0 0
c8t1d0s0  ONLINE   0 0 0
c8t0d0s0  ONLINE   0 0 0
cache
  c12d0s0 ONLINE   0 0 0

errors: No known data errors


I'm running netatalk file sharing for mac, and using it as a time machine 
backup server for my mac laptop.

When files are copying to the server, I often see periods of a minute or so 
where network traffic stops. I'm convinced that there's some bottleneck in the 
storage side of things because when this happens, I can still ping the machine 
and if I have an ssh window, open, I can still see output from a `top` command 
running smoothly. However, if I try and do anything that touches disk (eg `ls`) 
that command stalls. At the time it comes good, everything comes good, file 
copies across the network continue, etc.

If I have a ssh terminal session open and run `iostat -nv 5` I see something 
like this:


extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
1.2   36.0  153.6 4608.0  1.2  0.3   31.99.3  16  18 c12d0
0.0  113.40.0 7446.7  0.8  0.17.00.5  15   5 c8t0d0
0.2  106.44.1 7427.8  4.0  0.1   37.81.4  93  14 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.4   73.2   25.7 9243.0  2.3  0.7   31.69.8  34  37 c12d0
0.0  226.60.0 24860.5  1.6  0.27.00.9  25  19 c8t0d0
0.2  127.63.4 12377.6  3.8  0.3   29.72.2  91  27 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   44.20.0 5657.6  1.4  0.4   31.79.0  19  20 c12d0
0.2   76.04.8 9420.8  1.1  0.1   14.21.7  12  13 c8t0d0
0.0   16.60.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.20.0   25.6  0.0  0.00.32.3   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   11.00.0 1365.6  9.0  1.0  818.1   90.9 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.20.00.10.0  0.0  0.00.1   25.4   0   1 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   17.60.0 2182.4  9.0  1.0  511.3   56.8 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   16.60.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   15.80.0 1959.2  9.0  1.0  569.6   63.3 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.20.00.10.0  0.0  0.00.10.1   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   17.40.0 2157.6  9.0  1.0  517.2   57.4 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   18.20.0 2256.8  9.0  1.0  494.5   54.9 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   14.80.0 1835.2  9.0  1.0  608.1   67.5 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.20.00.10.0  0.0  0.00.10.1   0   0 c12d0
0.01.40.00.6  0.0  0.00.00.2   0   0 c8t0d0
0.0   49.00.0 6049.6  6.7  0.5  137.6   11.2 100  55 c8t1d0
extended device statistics  
r/sw/s   kr/s

[zfs-discuss] ZFS Newbie question

2011-02-07 Thread Gaikokujin Kyofusho

I’ve spend a few hours reading through the forums and wiki and honestly my head 
is spinning. I have been trying to study up on either buying or building a box 
that would allow me to add drives of varying sizes/speeds/brands (adding more 
later etc) and still be able to use the full space of drives (minus parity? 
[not sure if I got the terminology right]) with redundancy. I have found the 
“all in one” solution being Drobo however it has many caveats such as a 
proprietary setup, limited number of drives (I am looking to eventually expand 
over 8 drives), and a price tag that is borderline criminal. 

From what I understand using ZFS one could setup something like RAID 6 
(RAID-Z2?) but with the ability to use drives of varying sizes/speeds/brands 
and able to add additional drives later. Am I about right? If so I will 
continue studying up on this if not then I guess I need to continue exploring 
different options. Thanks!!

Cheers,

-Gaiko
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] deduplication requirements

2011-02-07 Thread Michael

Hi guys,

I'm currently running 2 zpools each in a raidz1 configuration, totally
around 16TB usable data. I'm running it all on an OpenSolaris based box with
2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and
underpowered for deduplication, so I'm looking at building a new system, but
wanted some advice first, here is what i've planned so far:

Core i7 2600 CPU
16gb DDR3 Memory
64GB SSD for ZIL (optional)

Would this produce decent results for deduplication of 16TB worth of pools
or would I need more RAM still?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] 40MB repaired on a disk during scrub but no errors

2011-02-07 Thread Peter Tripp

Hey folks,

While scrubbing, zpool status shows nearly 40MB repaired but 0 in each of the 
read/write/checksum columns for each disk.  One disk has (repairing) to the 
right but once the scrub completes there's no mention that anything ever needed 
fixing.  Any idea what would need to be repaired on that disk? Are there any 
other types of errors besides read/write/checksum?  Previously, whenever a disk 
has required repair during scrub it's been either bad disk or loose cable 
connection and it's generated read, write and/or cksum errors.  It also irks me 
a little that these repairs are only noted while the scrub is running. Once 
it's complete, it's as if those repairs never happened.

If it's relevant, this is a 6 drive mirrored pool with a single SSD for L2Arc 
cache. Pool version 26 under Nexenta Core Platform 3.0 with a LSI 9200-16E and 
SATA disks.

$ zpool status bigboy
  pool: bigboy
 state: ONLINE
 scan: scrub in progress since Sat Feb  5 02:22:18 2011
3.74T scanned out of 3.74T at 141M/s, 0h0m to go
37.9M repaired, 99.88% done
[-config snip - all columns 0, one drive on the right has (repairing)]
errors: No known data errors

And then once the scrub completes:

$ zpool status bigboy
  pool: bigboy
 state: ONLINE
 scan: scrub repaired 37.9M in 7h42m with 0 errors on Sat Feb  5 10:04:53 2011
[-config snip - all columns 0, the (repairing) note is now gone]
errors: No known data errors
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-07 Thread Gaikokujin Kyofusho

Thank you kebabber. I will try out indiana and virtual box to play around with 
it a bit.

Just to make sure I understand your example, if I say had a 4x2tb drives, 
2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 
mirrored + 1 mirrored), in terms of accessing them would they just be mounted 
like 3 partitions or could it all be accessed like one big partition?

Anywho, I have indiana DL'ing now (very slow connection so thought I would post 
while i wait).

Cheers,

-Gaiko
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Roch


Le 7 févr. 2011 à 06:25, Richard Elling a écrit :

 On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote:
 
 Hi all,
 
 I'm trying to achieve the same effect of UFS directio on ZFS and here
 is what I did:
 
 Solaris UFS directio has three functions:
   1. improved async code path
   2. multiple concurrent writers
   3. no buffering
 
 Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing
 to set or change to take advantage of the feature.
 
 
 1. Set the primarycache of zfs to metadata and secondarycache to none,
 recordsize to 8K (to match the unit size of writes)
 2. Run my test program (code below) with different options and measure
 the running time.
 a) open the file without O_DSYNC flag: 0.11s.
 This doesn't seem like directio is in effect, because I tried on UFS
 and time was 2s. So I went on with more experiments with the O_DSYNC
 flag set. I know that directio and O_DSYNC are two different things,
 but I thought the flag would force synchronous writes and achieve what
 directio does (and more).
 
 Directio and O_DSYNC are two different features.
 
 b) open the file with O_DSYNC flag: 147.26s
 
 ouch

how big a file ?
Does the resuld holds if you don't truncate ?

-r

 
 c) same as b) but also enabled zfs_nocacheflush: 5.87s
 
 Is your pool created from a single HDD?
 
 My questions are:
 1. With my primarycache and secondarycache settings, the FS shouldn't
 buffer reads and writes anymore. Wouldn't that be equivalent to
 O_DSYNC? Why a) and b) are so different?
 
 No. O_DSYNC deals with when the I/O is committed to media.
 
 2. My understanding is that zfs_nocacheflush essentially removes the
 sync command sent to the device, which cancels the O_DSYNC flag. Why
 b) and c) are so different?
 
 No. Disabling the cache flush means that the volatile write buffer in the 
 disk is not flushed. In other words, disabling the cache flush is in direct
 conflict with the semantics of O_DSYNC.
 
 3. Does ZIL have anything to do with these results?
 
 Yes. The ZIL is used for meeting the O_DSYNC requirements.  This has
 nothing to do with buffering. More details are on the ZFS Best Practices 
 Guide.
 -- richard
 
 
 Thanks in advance for any suggestion/insight!
 Yi
 
 
 #include fcntl.h
 #include sys/time.h
 
 int main(int argc, char **argv)
 {
  struct timeval tim;
  gettimeofday(tim, NULL);
  double t1 = tim.tv_sec + tim.tv_usec/100.0;
  char a[8192];
  int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660);
  //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660);
  if (argv[2][0] == '1')
  directio(fd, DIRECTIO_ON);
  int i;
  for (i=0; i1; ++i)
  pwrite(fd, a, sizeof(a), i*8192);
  close(fd);
  gettimeofday(tim, NULL);
  double t2 = tim.tv_sec + tim.tv_usec/100.0;
  printf(%f\n, t2-t1);
 }
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replace block devices to increase pool size

2011-02-07 Thread David Dyer-Bennet


On Sun, February 6, 2011 08:41, Achim Wolpers wrote:

 I have a zpool biult up from two vdrives (one mirror and one raidz). The
 raidz is built up from 4x1TB HDs. When I successively replace each 1TB
 drive with a 2TB drive will the capacity of the raidz double after the
 last block device is replaced?

You may have to manually set property autoexpand=on; I found yesterday
that I had to (in my case on a mirror that I was upgrading).  Probably
depends on what version you created things at and/or what version you're
running now.

I replaced the drives in one of the three mirror vdevs in my main pool
over this last weekend, and it all went quite smoothly, but I did have to
turn on autoexpand at the end of the process to see the new space.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 13

2011-02-07 Thread David Dyer-Bennet


On Sun, February 6, 2011 13:01, Michael Armstrong wrote:
 Additionally, the way I do it is to draw a diagram of the drives in the
 system, labelled with the drive serial numbers. Then when a drive fails, I
 can find out from smartctl which drive it is and remove/replace without
 trial and error.

Having managed to muddle through this weekend without loss (though with a
certain amount of angst and duplication of efforts), I'm in the mood to
label things a bit more clearly on my system :-).

smartctl doesn't seem to be on my system, though.  I'm running
snv_134.  I'm still pretty badly lost in the whole repository /
package thing with Solaris, most of my brain cells were already
occupied with Red Hat, Debian, and Perl package information :-( .
Where do I look?

Are the controller port IDs, the C9T3D0 things that ZFS likes,
reasonably stable?  They won't change just because I add or remove
drives, right; only maybe if I change controller cards?

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 10:26 AM, Roch roch.bourbonn...@oracle.com wrote:

 Le 7 févr. 2011 à 06:25, Richard Elling a écrit :

 On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote:

 Hi all,

 I'm trying to achieve the same effect of UFS directio on ZFS and here
 is what I did:

 Solaris UFS directio has three functions:
       1. improved async code path
       2. multiple concurrent writers
       3. no buffering

 Of the three, #1 and #2 were designed into ZFS from day 1, so there is 
 nothing
 to set or change to take advantage of the feature.


 1. Set the primarycache of zfs to metadata and secondarycache to none,
 recordsize to 8K (to match the unit size of writes)
 2. Run my test program (code below) with different options and measure
 the running time.
 a) open the file without O_DSYNC flag: 0.11s.
 This doesn't seem like directio is in effect, because I tried on UFS
 and time was 2s. So I went on with more experiments with the O_DSYNC
 flag set. I know that directio and O_DSYNC are two different things,
 but I thought the flag would force synchronous writes and achieve what
 directio does (and more).

 Directio and O_DSYNC are two different features.

 b) open the file with O_DSYNC flag: 147.26s

 ouch

 how big a file ?
 Does the resuld holds if you don't truncate ?

 -r

The file is 8K*1 about 80M. I removed the O_TRUNC flag and the
results stayed the same...


 c) same as b) but also enabled zfs_nocacheflush: 5.87s

 Is your pool created from a single HDD?

 My questions are:
 1. With my primarycache and secondarycache settings, the FS shouldn't
 buffer reads and writes anymore. Wouldn't that be equivalent to
 O_DSYNC? Why a) and b) are so different?

 No. O_DSYNC deals with when the I/O is committed to media.

 2. My understanding is that zfs_nocacheflush essentially removes the
 sync command sent to the device, which cancels the O_DSYNC flag. Why
 b) and c) are so different?

 No. Disabling the cache flush means that the volatile write buffer in the
 disk is not flushed. In other words, disabling the cache flush is in direct
 conflict with the semantics of O_DSYNC.

 3. Does ZIL have anything to do with these results?

 Yes. The ZIL is used for meeting the O_DSYNC requirements.  This has
 nothing to do with buffering. More details are on the ZFS Best Practices 
 Guide.
 -- richard


 Thanks in advance for any suggestion/insight!
 Yi


 #include fcntl.h
 #include sys/time.h

 int main(int argc, char **argv)
 {
  struct timeval tim;
  gettimeofday(tim, NULL);
  double t1 = tim.tv_sec + tim.tv_usec/100.0;
  char a[8192];
  int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660);
  //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660);
  if (argv[2][0] == '1')
      directio(fd, DIRECTIO_ON);
  int i;
  for (i=0; i1; ++i)
      pwrite(fd, a, sizeof(a), i*8192);
  close(fd);
  gettimeofday(tim, NULL);
  double t2 = tim.tv_sec + tim.tv_usec/100.0;
  printf(%f\n, t2-t1);
 }
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-07 Thread Deano

In zfs terminology each of the groups you have is a VDEV and a zpool can be
made of a number of VDEVs. This zpool can then be mounted as a single
filesystem, or you can split it into as many filesystems as you wish.

So the answer is yes to all the configurations you asked about and a lot
more :)

Bye,
Deano
de...@cloudpixies.com

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Gaikokujin
Kyofusho
Sent: 05 February 2011 17:55
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

Thank you kebabber. I will try out indiana and virtual box to play around
with it a bit.

Just to make sure I understand your example, if I say had a 4x2tb drives,
2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1
mirrored + 1 mirrored), in terms of accessing them would they just be
mounted like 3 partitions or could it all be accessed like one big
partition?

Anywho, I have indiana DL'ing now (very slow connection so thought I would
post while i wait).

Cheers,

-Gaiko
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-07 Thread Brandon High

On Sat, Feb 5, 2011 at 9:54 AM, Gaikokujin Kyofusho
gaikokujinkyofu...@gmail.com wrote:
 Just to make sure I understand your example, if I say had a 4x2tb drives, 
 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 
 mirrored + 1 mirrored), in terms of accessing them would they just be mounted 
 like 3 partitions or could it all be accessed like one big partition?

You could add them to one pool, and then create multiple filesystems
inside the pool. You total storage would be the sum of the drives'
capacity after redundancy, or 3x2tb + 750gb + 1.5tb.

It's not recommended to use different levels of redundancy in a pool,
so you may want to consider using mirrors for everything. This also
makes it easier to add or upgrade capacity later.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Brandon High

On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang yizhan...@gmail.com wrote:
 On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling
 richard.ell...@gmail.com wrote:
 Solaris UFS directio has three functions:
        1. improved async code path
        2. multiple concurrent writers
        3. no buffering

 Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS.
 But as I said, apprently 2.a) below didn't give me that. Do you have
 any suggestion?

Don't. Use a ZIL, which will meet the requirements for synchronous IO.
Set primarycache to metadata to prevent caching reads.

ZFS is a very different beast than UFS and doesn't require the same tuning.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 21

2011-02-07 Thread Michael Armstrong

I obtained smartmontools (which includes smartctl) from the standard apt 
repository (i'm using nexenta however), in addition its neccessary to use the 
device type of sat,12 with smartctl to get it to read attributes correctly in 
OS afaik. Also regarding dev id's on the system, from what i've seen they are 
assigned to ports therefor do not change, however upon changing a controller 
will most likely change unless its the same chipset with exactly the same port 
configuration. Hope this helps.

On 7 Feb 2011, at 18:04, zfs-discuss-requ...@opensolaris.org wrote:

 Having managed to muddle through this weekend without loss (though with a
 certain amount of angst and duplication of efforts), I'm in the mood to
 label things a bit more clearly on my system :-).
 
 smartctl doesn't seem to be on my system, though.  I'm running
 snv_134.  I'm still pretty badly lost in the whole repository /
 package thing with Solaris, most of my brain cells were already
 occupied with Red Hat, Debian, and Perl package information :-( .
 Where do I look?
 
 Are the controller port IDs, the C9T3D0 things that ZFS likes,
 reasonably stable?  They won't change just because I add or remove
 drives, right; only maybe if I change controller cards?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 1:06 PM, Brandon High bh...@freaks.com wrote:
 On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang yizhan...@gmail.com wrote:
 On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling
 richard.ell...@gmail.com wrote:
 Solaris UFS directio has three functions:
        1. improved async code path
        2. multiple concurrent writers
        3. no buffering

 Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS.
 But as I said, apprently 2.a) below didn't give me that. Do you have
 any suggestion?

 Don't. Use a ZIL, which will meet the requirements for synchronous IO.
 Set primarycache to metadata to prevent caching reads.

 ZFS is a very different beast than UFS and doesn't require the same tuning.

I already set primarycache to metadata, and I'm not concerned about
caching reads, but caching writes. It appears writes are indeed cached
judging from the time of 2.a) compared to UFS+directio. More
specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while
80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Brandon High

On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang yizhan...@gmail.com wrote:
 I already set primarycache to metadata, and I'm not concerned about
 caching reads, but caching writes. It appears writes are indeed cached
 judging from the time of 2.a) compared to UFS+directio. More
 specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while
 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't.

You're trying to force a solution that isn't relevant for the
situation. ZFS is not UFS, and solutions that are required for UFS to
work correctly are not needed with ZFS.

Yes, writes are cached, but all the POSIX requirements for synchronous
IO are met by the ZIL. As long as your storage devices, be they SAN,
DAS or somewhere in between respect cache flushes, you're fine. If you
need more performance, use a slog device that respects cache flushes.
You don't need to worry about whether writes are being cached, because
any data that is written synchronously will be committed to stable
storage before the write returns.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Sector size on 7K3000 drives?

2011-02-07 Thread Roy Sigurd Karlsbakk

Hi al

Does anyone here that knows if the new 7K3000 drives from Hitachi uses 4k 
sectors or not? The docs say Sector size (variable, Bytes/sector): 512, but 
since it's variable, any idea what it might be? I'm planning to replace 7x3+1 
drives on this system to try to get some free space on some full VDEVs. If the 
drives are in fact 4k sector drives, will it be possible to remedy the 
performance penalty now that we have already used about 99% of the original 
drives?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote:
 On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang yizhan...@gmail.com wrote:
 I already set primarycache to metadata, and I'm not concerned about
 caching reads, but caching writes. It appears writes are indeed cached
 judging from the time of 2.a) compared to UFS+directio. More
 specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while
 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't.

 You're trying to force a solution that isn't relevant for the
 situation. ZFS is not UFS, and solutions that are required for UFS to
 work correctly are not needed with ZFS.

 Yes, writes are cached, but all the POSIX requirements for synchronous
 IO are met by the ZIL. As long as your storage devices, be they SAN,
 DAS or somewhere in between respect cache flushes, you're fine. If you
 need more performance, use a slog device that respects cache flushes.
 You don't need to worry about whether writes are being cached, because
 any data that is written synchronously will be committed to stable
 storage before the write returns.

 -B

 --
 Brandon High : bh...@freaks.com

Maybe I didn't make my intention clear. UFS with directio is
reasonably close to a raw disk from my application's perspective: when
the app writes to a file location, no buffering happens. My goal is to
find a way to duplicate this on ZFS.

Setting primarycache didn't eliminate the buffering, using O_DSYNC
(whose side effects include elimination of buffering) made it
ridiculously slow: none of the things I tried eliminated buffering,
and just buffering, on ZFS.

From the discussion so far my feeling is that ZFS is too different
from UFS that there's simply no way to achieve this goal...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Roy Sigurd Karlsbakk

 Maybe I didn't make my intention clear. UFS with directio is
 reasonably close to a raw disk from my application's perspective: when
 the app writes to a file location, no buffering happens. My goal is to
 find a way to duplicate this on ZFS.

There really is no need to do this on ZFS. Using an SLOG device (ZIL on an SSD) 
will allow ZFS to do its caching transpararently to the application. Successive 
read operations will read from the cache if that's available and writes will go 
to the SLOG _and_ the ARC for successive commits. So long the SLOG device 
supports cache flush, or have a supercap/BBU, your data will be safe.

 Setting primarycache didn't eliminate the buffering, using O_DSYNC
 (whose side effects include elimination of buffering) made it
 ridiculously slow: none of the things I tried eliminated buffering,
 and just buffering, on ZFS.
 
 From the discussion so far my feeling is that ZFS is too different
 from UFS that there's simply no way to achieve this goal...

See above - ZFS is quite safe to use for this, given a good hardware 
configuration.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replace block devices to increase pool size

2011-02-07 Thread Roy Sigurd Karlsbakk

  I have a zpool biult up from two vdrives (one mirror and one raidz).
  The
  raidz is built up from 4x1TB HDs. When I successively replace each
  1TB
  drive with a 2TB drive will the capacity of the raidz double after
  the
  last block device is replaced?
 
 You may have to manually set property autoexpand=on; I found yesterday
 that I had to (in my case on a mirror that I was upgrading). Probably
 depends on what version you created things at and/or what version
 you're
 running now.
 
 I replaced the drives in one of the three mirror vdevs in my main pool
 over this last weekend, and it all went quite smoothly, but I did have
 to
 turn on autoexpand at the end of the process to see the new space.

autoexpand is by default off. I guess this is in case someone do something as 
replacing two 500GB drives with two 1TB drives and then wants to replace those 
with new 500GB drives, since setting autoexpand=on is quite simple and for the 
expansion, irreversible. If you have expanded a VDEV, the only way to shrink it 
is to backup, reconfigure ZFS and restore.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sector size on 7K3000 drives?

2011-02-07 Thread David Magda

On Mon, February 7, 2011 14:12, Roy Sigurd Karlsbakk wrote:
 Hi al

 Does anyone here that knows if the new 7K3000 drives from Hitachi uses 4k
 sectors or not? The docs say Sector size (variable, Bytes/sector): 512,
 but since it's variable, any idea what it might be? I'm planning to
[...]

This PDF data sheet for SATA says 512, without the word variable:

http://tinyurl.com/4b6qgtc
http://www.hitachigst.com/tech/techlib.nsf/techdocs/EC6D440C3F64DBCC8825782300026498/$file/US7K3000_ds.pdf

Of course they could mean what's reported to the OS (the SAS models say
512 / 520 / 528).

I'd e-mail or call Hitachi and ask them directly (the contact info is in
the PDF).

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 2:21 PM, Brandon High bh...@freaks.com wrote:
 On Mon, Feb 7, 2011 at 11:17 AM, Yi Zhang yizhan...@gmail.com wrote:
 Maybe I didn't make my intention clear. UFS with directio is
 reasonably close to a raw disk from my application's perspective: when
 the app writes to a file location, no buffering happens. My goal is to
 find a way to duplicate this on ZFS.

 Step back an consider *why* you need no buffering.
I'm writing an database-like application which manages its own page
buffer, so I want to disable the buffering in the OS/FS level. UFS
with directio suits my need perfectly, but I also want to try it on
ZFS because ZFS does't directly overwrite a page which is being
modified (it allocates a new page instead), and thus it represents a
different category of FS. I want to measure the performance difference
of my app on UFS and ZFS and tell how my app is FS-dependent.

 From the discussion so far my feeling is that ZFS is too different
 from UFS that there's simply no way to achieve this goal...

 ZFS is not UFS, and solutions that are required for UFS to work
 correctly are not needed with ZFS.

 -B

 --
 Brandon High : bh...@freaks.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Nico Williams

On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang yizhan...@gmail.com wrote:
 On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote:
 Maybe I didn't make my intention clear. UFS with directio is
 reasonably close to a raw disk from my application's perspective: when
 the app writes to a file location, no buffering happens. My goal is to
 find a way to duplicate this on ZFS.

You're still mixing directio and O_DSYNC.

O_DSYNC is like calling fsync(2) after every write(2).  fsync(2) is
useful to obtain
some limited transactional semantics, as well as for durability
semantics.  In ZFS
you don't need to call fsync(2) to get those transactional semantics, but you do
need to call fsync(2) get those durability semantics.

Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly
more than just the data blocks you wrote to.  Which means that O_DSYNC on ZFS
is significantly slower than on UFS.  You can address this in one of two ways:
a) you might realize that you don't need every write(2) to be durable, then stop
using O_DSYNC, b) you might get a fast ZIL device.

I'm betting that if you look carefully at your application's requirements you'll
probably conclude that you don't need O_DSYNC at all.  Perhaps you can tell us
more about your application.

 Setting primarycache didn't eliminate the buffering, using O_DSYNC
 (whose side effects include elimination of buffering) made it
 ridiculously slow: none of the things I tried eliminated buffering,
 and just buffering, on ZFS.

 From the discussion so far my feeling is that ZFS is too different
 from UFS that there's simply no way to achieve this goal...

You've not really stated your application's requirements.  You may be convinced
that you need O_DSYNC, but chances are that you don't.  And yes, it's possible
that you'd need O_DSYNC on UFS but not on ZFS.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 2:42 PM, Nico Williams n...@cryptonector.com wrote:
 On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang yizhan...@gmail.com wrote:
 On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote:
 Maybe I didn't make my intention clear. UFS with directio is
 reasonably close to a raw disk from my application's perspective: when
 the app writes to a file location, no buffering happens. My goal is to
 find a way to duplicate this on ZFS.

 You're still mixing directio and O_DSYNC.

 O_DSYNC is like calling fsync(2) after every write(2).  fsync(2) is
 useful to obtain
 some limited transactional semantics, as well as for durability
 semantics.  In ZFS
 you don't need to call fsync(2) to get those transactional semantics, but you 
 do
 need to call fsync(2) get those durability semantics.

 Now, in ZFS fsync(2) implies a synchronous I/O operation involving 
 significantly
 more than just the data blocks you wrote to.  Which means that O_DSYNC on ZFS
 is significantly slower than on UFS.  You can address this in one of two ways:
 a) you might realize that you don't need every write(2) to be durable, then 
 stop
 using O_DSYNC, b) you might get a fast ZIL device.

 I'm betting that if you look carefully at your application's requirements 
 you'll
 probably conclude that you don't need O_DSYNC at all.  Perhaps you can tell us
 more about your application.

 Setting primarycache didn't eliminate the buffering, using O_DSYNC
 (whose side effects include elimination of buffering) made it
 ridiculously slow: none of the things I tried eliminated buffering,
 and just buffering, on ZFS.

 From the discussion so far my feeling is that ZFS is too different
 from UFS that there's simply no way to achieve this goal...

 You've not really stated your application's requirements.  You may be 
 convinced
 that you need O_DSYNC, but chances are that you don't.  And yes, it's possible
 that you'd need O_DSYNC on UFS but not on ZFS.

 Nico
 --

Please see my previous email for a high-level discussion of my
application. I know that I don't really need O_DSYNC. The reason why I
tried that is to get the side effect of no buffering, which is my
ultimate goal.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive i/o anomaly

2011-02-07 Thread Marion Hakanson

matt.connolly...@gmail.com said:
 extended device statistics  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1.2   36.0  153.6 4608.0  1.2  0.3   31.99.3  16  18 c12d0
 0.0  113.40.0 7446.7  0.8  0.17.00.5  15   5 c8t0d0
 0.2  106.44.1 7427.8  4.0  0.1   37.81.4  93  14 c8t1d0
 extended device statistics  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.4   73.2   25.7 9243.0  2.3  0.7   31.69.8  34  37 c12d0
 0.0  226.60.0 24860.5  1.6  0.27.00.9  25  19 c8t0d0
 0.2  127.63.4 12377.6  3.8  0.3   29.72.2  91  27 c8t1d0
 extended device statistics  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.0   44.20.0 5657.6  1.4  0.4   31.79.0  19  20 c12d0
 0.2   76.04.8 9420.8  1.1  0.1   14.21.7  12  13 c8t0d0
 0.0   16.60.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
 extended device statistics  
 r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 0.00.20.0   25.6  0.0  0.00.32.3   0   0 c12d0
 0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
 0.0   11.00.0 1365.6  9.0  1.0  818.1   90.9 100 100 c8t1d0 
 . . .

matt.connolly...@gmail.com said:
 I expect that the c8t0d0 WD Green is the lemon here and for some reason is
 getting stuck in periods where it can write no faster than about 2MB/s. Does
 this sound right? 

No, it's the opposite.  The drive sitting at 100%-busy, c8t1d0, while the
other drive is idle, is the sick one.  It's slower than the other, has 9.0
operations waiting (queued) to finish.  The other one is idle because it
has already finished the write activity and is waiting for the slow one
in the mirror to catch up.  If you run iostat -xn without the interval
argument, i.e. so it prints out only one set of stats, you'll see the
average performance of the drives since last reboot.  If the asvc_t
figure is significantly larger for one drive than the other, that's a
way to identify the one which has been slower over the long term.


 Secondly, what I wonder is why it is that the whole file system seems to hang
 up at this time. Surely if the other drive is doing nothing, a web page can
 be served by reading from the available drive (c8t1d0) while the slow drive
 (c8t0d0) is stuck writing slow. 

The available drive is c8t0d0 in this case.  However, if ZFS is in the
middle of a txg (ZFS transaction) commit, it cannot safely do much with
the pool until that commit finishes.  You can see that ZFS only lets 10
operations accumulate per drive (used to be 35), i.e. 9.0 in the wait
column, and 1.0 in the actv column, so it's kinda stuck until the
drive gets its work done.

Maybe the drive is failing, or maybe it's one of those with large sectors
that are not properly aligned with the on-disk partitions.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Roch


Le 7 févr. 2011 à 17:08, Yi Zhang a écrit :

 On Mon, Feb 7, 2011 at 10:26 AM, Roch roch.bourbonn...@oracle.com wrote:
 
 Le 7 févr. 2011 à 06:25, Richard Elling a écrit :
 
 On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote:
 
 Hi all,
 
 I'm trying to achieve the same effect of UFS directio on ZFS and here
 is what I did:
 
 Solaris UFS directio has three functions:
   1. improved async code path
   2. multiple concurrent writers
   3. no buffering
 
 Of the three, #1 and #2 were designed into ZFS from day 1, so there is 
 nothing
 to set or change to take advantage of the feature.
 
 
 1. Set the primarycache of zfs to metadata and secondarycache to none,
 recordsize to 8K (to match the unit size of writes)
 2. Run my test program (code below) with different options and measure
 the running time.
 a) open the file without O_DSYNC flag: 0.11s.
 This doesn't seem like directio is in effect, because I tried on UFS
 and time was 2s. So I went on with more experiments with the O_DSYNC
 flag set. I know that directio and O_DSYNC are two different things,
 but I thought the flag would force synchronous writes and achieve what
 directio does (and more).
 
 Directio and O_DSYNC are two different features.
 
 b) open the file with O_DSYNC flag: 147.26s
 
 ouch
 
 how big a file ?
 Does the resuld holds if you don't truncate ?

OK, if it had been a 2TB file, I could have seen an opening. Not for 80M though.
So it's baffling Unless !

It's not just the open which takes 147s it's the whole run, 1 writes.
1 sync writes without an SDD would take at 150 second at 68 IO/S.

with the O_DSYNC flag then all writes are to memory so it's expected to take 
0.11s to tracne
8K at 750MB/sec (memcopy speed).

O_DSYNC + zfs_nocacheflush is in between. Every write transfers data to an 
unstable cache but then does not flush it.
At some point the cache might overflow and so some write have high latency 
while the data is transfering from disk cache to disk platter.

So those results are inline with what everybody has been seeing before

Note that to compare with UFS, since UFS does not cache flush after every sync 
write like ZFS correctly does
you have to compare UFS + write cache disabled to ZFS (with or without write 
cache). 

After deleting a zfs pool, the disk write is left on and so a UFS  filesystem 
wil appear inordinately fast then
unless you turn off the write cache with format -e; cache , write_cache; 
disable.

-r


 
 -r
 
 The file is 8K*1 about 80M. I removed the O_TRUNC flag and the
 results stayed the same...
 
 
 c) same as b) but also enabled zfs_nocacheflush: 5.87s
 
 Is your pool created from a single HDD?
 
 My questions are:
 1. With my primarycache and secondarycache settings, the FS shouldn't
 buffer reads and writes anymore. Wouldn't that be equivalent to
 O_DSYNC? Why a) and b) are so different?
 
 No. O_DSYNC deals with when the I/O is committed to media.
 
 2. My understanding is that zfs_nocacheflush essentially removes the
 sync command sent to the device, which cancels the O_DSYNC flag. Why
 b) and c) are so different?
 
 No. Disabling the cache flush means that the volatile write buffer in the
 disk is not flushed. In other words, disabling the cache flush is in direct
 conflict with the semantics of O_DSYNC.
 
 3. Does ZIL have anything to do with these results?
 
 Yes. The ZIL is used for meeting the O_DSYNC requirements.  This has
 nothing to do with buffering. More details are on the ZFS Best Practices 
 Guide.
 -- richard
 
 
 Thanks in advance for any suggestion/insight!
 Yi
 
 
 #include fcntl.h
 #include sys/time.h
 
 int main(int argc, char **argv)
 {
  struct timeval tim;
  gettimeofday(tim, NULL);
  double t1 = tim.tv_sec + tim.tv_usec/100.0;
  char a[8192];
  int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660);
  //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660);
  if (argv[2][0] == '1')
  directio(fd, DIRECTIO_ON);
  int i;
  for (i=0; i1; ++i)
  pwrite(fd, a, sizeof(a), i*8192);
  close(fd);
  gettimeofday(tim, NULL);
  double t2 = tim.tv_sec + tim.tv_usec/100.0;
  printf(%f\n, t2-t1);
 }
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Bill Sommerfeld


On 02/07/11 11:49, Yi Zhang wrote:

The reason why I
tried that is to get the side effect of no buffering, which is my
ultimate goal.


ultimate = final.  you must have a goal beyond the elimination of 
buffering in the filesystem.


if the writes are made durable by zfs when you need them to be durable, 
why does it matter that it may buffer data while it is doing so?


- Bill
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 2:54 PM, Nico Williams n...@cryptonector.com wrote:
 On Mon, Feb 7, 2011 at 1:49 PM, Yi Zhang yizhan...@gmail.com wrote:
 Please see my previous email for a high-level discussion of my
 application. I know that I don't really need O_DSYNC. The reason why I
 tried that is to get the side effect of no buffering, which is my
 ultimate goal.

 ZFS cannot not buffer.  The reason is that ZFS likes to batch transactions 
 into
 as large a contiguous write to disk as possible.  The ZIL exists to
 support fsyn(2)
 operations that must commit before the rest of a ZFS transaction.  In
 other words:
 there's always some amount of buffering of writes in ZFS.
In that case, ZFS doesn't suit my needs.

 As to read buffering, why would you want to disable those?
My application manages its own buffer and reads/writes go through that
buffer first. I don't want double buffering.

 You still haven't told us what your application does.  Or why you want
 to get close
 to the metal.  Simply telling us that you need no buffering doesn't
 really help us
 help you -- with that approach you'll simply end up believing that ZFS is not
 appropriate for your needs, even though it well might be.
It's like the Berkeley DB on a high level, though it doesn't require
transaction support, durability, etc. I'm measuring its performance
and don't want FS buffer to pollute my results (hence directio).

 Nico
 --

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 3:14 PM, Bill Sommerfeld sommerf...@alum.mit.edu wrote:
 On 02/07/11 11:49, Yi Zhang wrote:

 The reason why I
 tried that is to get the side effect of no buffering, which is my
 ultimate goal.

 ultimate = final.  you must have a goal beyond the elimination of
 buffering in the filesystem.

 if the writes are made durable by zfs when you need them to be durable, why
 does it matter that it may buffer data while it is doing so?

                                                - Bill

If buffering is on, the running time of my app doesn't reflect the
actual I/O cost. My goal is to accurately measure the time of I/O.
With buffering on, ZFS would batch up a bunch of writes and change
both the original I/O activity and the time.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive i/o anomaly

2011-02-07 Thread Matt Connolly

Thanks, Marion.

(I actually got the drive labels mixed up in the original post... I edited it 
on the forum page: 
http://opensolaris.org/jive/thread.jspa?messageID=511057#511057 )

My suspicion was the same: the drive doing the slow i/o is the problem.

I managed to confirm that by taking the other drive offline (c8t0d0 samsung), 
and the same stalls and slow i/o occurred.

After putting the drive online (and letting the resilver complete) I took the 
slow drive (c8t1d0 western digital green) offline and the system ran very 
nicely.

It is a 4k sector drive, but I thought zfs recognised those drives and didn't 
need any special configuration...?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Newbie question

2011-02-07 Thread David Dyer-Bennet


On Sat, February 5, 2011 03:54, Gaikokujin Kyofusho wrote:

 From what I understand using ZFS one could setup something like RAID 6
 (RAID-Z2?) but with the ability to use drives of varying
 sizes/speeds/brands and able to add additional drives later. Am I about
 right? If so I will continue studying up on this if not then I guess I
 need to continue exploring different options. Thanks!!

IMHO, your best bet for this kind of configuration is to use mirror pairs,
not RAIDZ*.  Because...

Things you can't do with RAIDZ*:

You cannot remove a vdev from a pool.

You cannot make a RAIDZ* vdev smaller (fewer disks).

You cannot make a RAIDZ* vdev larger (more disks).

To increase the storage capacity of a RAIDZ* vdev you need to replace all
the drives, one at a time, waiting for resilver between replacements
(resilver times can be VERY long with big modern drives).  And during each
resilver, your redundancy will be reduced by 1 -- meaning a RAIDZ array
would have NO redundancy during the resilver.  (And activity in the pool
is high during the resilver -- meaning the chances of any marginal drive
crapping out are higher than normal during the resilver.)

With mirrors, you can add new space by adding simply two drives (add a new
mirror vdev).

You can upgrade an existing mirror by replacing only two drives.

You can upgrade an existing mirror without reducing redundancy below your
starting point ever -- you attach a new drive, wait for the resilver to
complete (at this point you have a three-way mirror), then detach one of
the original drives; repeat for another new drive and the other original
drive.

Obviously, using mirrors requires you to buy more drives for any given
amount of usable space.

I must admit that my 8-bay hot-swap ZFS server cost me a LOT more than a
Drobo (but then I bought in 2006, too).

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-07 Thread David Dyer-Bennet


On Sat, February 5, 2011 11:54, Gaikokujin Kyofusho wrote:
 Thank you kebabber. I will try out indiana and virtual box to play around
 with it a bit.

 Just to make sure I understand your example, if I say had a 4x2tb drives,
 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 +
 1 mirrored + 1 mirrored), in terms of accessing them would they just be
 mounted like 3 partitions or could it all be accessed like one big
 partition?

A ZFS pool can contain many vdevs; you could put the three groups you
describe into one pool, and then assign one (or more) file-systems to that
pool.  Putting them all in one pool seems to me the natural way to handle
it; they're all similar levels of redundancy.  It's more flexible to have
everything in one pool, generally.

(You could also make separate pools; my experience, for what it's worth,
argues for making pools based on redundancy and performance (and only
worry about BIG differences), and assign file-systems to pools based on
needs for redundancy and performance.  And for my home system I just have
one big data pool, currently consisting of 1x1TB, 2x400GB, 2x400GB, plus
1TB hot spare.)

Or you could stick strictly to mirrors; 4 pools 2x2T, 2x2T, 2x750G,
2x1.5T.  Mirrors are more flexible, give you more redundancy, and are much
easier to work with.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread David Dyer-Bennet


On Mon, February 7, 2011 14:49, Yi Zhang wrote:
 On Mon, Feb 7, 2011 at 3:14 PM, Bill Sommerfeld sommerf...@alum.mit.edu
 wrote:
 On 02/07/11 11:49, Yi Zhang wrote:

 The reason why I
 tried that is to get the side effect of no buffering, which is my
 ultimate goal.

 ultimate = final. Â you must have a goal beyond the elimination of
 buffering in the filesystem.

 if the writes are made durable by zfs when you need them to be durable,
 why
 does it matter that it may buffer data while it is doing so?

 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â -
 Bill

 If buffering is on, the running time of my app doesn't reflect the
 actual I/O cost. My goal is to accurately measure the time of I/O.
 With buffering on, ZFS would batch up a bunch of writes and change
 both the original I/O activity and the time.

I'm not sure I understand what you're trying to measure (which seems to be
your top priority).  Achievable performance with ZFS would be better using
suitable caching; normally that's the benchmark statistic people would
care about.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] deduplication requirements

2011-02-07 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Michael
 
 Core i7 2600 CPU
 16gb DDR3 Memory
 64GB SSD for ZIL (optional)
 
 Would this produce decent results for deduplication of 16TB worth of pools
 or would I need more RAM still?

What matters is the amount of unique data in your pool.  I'll just assume
it's all unique, but of course that's ridiculous because if it's all unique
then why would you want to enable dedup.  But anyway, I'm assuming 16T of
unique data.  

The rule is a little less than 3G of ram for every 1T of unique data.  In
your case, 16*2.8 = 44.8G ram required in addition to your base ram
configuration.  You need at least 48G of ram.  Or less unique data.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive i/o anomaly

2011-02-07 Thread Marion Hakanson

matt.connolly...@gmail.com said:
 After putting the drive online (and letting the resilver complete) I took the
 slow drive (c8t1d0 western digital green) offline and the system ran very
 nicely.
 
 It is a 4k sector drive, but I thought zfs recognised those drives and didn't
 need any special configuration...? 

That's a nice confirmation of the cost of not doing anything special (:-).

I hear the problem may be due to 4k drives which report themselves as 512b
drives, for boot/BIOS compatibility reasons.  I've also seen various ways
to force 4k alignment, and check what the ashift value is in your pool's
drives, etc.  Google solaris zfs 4k sector align will lead the way.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Yi Zhang

On Mon, Feb 7, 2011 at 3:47 PM, Nico Williams n...@cryptonector.com wrote:
 On Mon, Feb 7, 2011 at 2:39 PM, Yi Zhang yizhan...@gmail.com wrote:
 On Mon, Feb 7, 2011 at 2:54 PM, Nico Williams n...@cryptonector.com wrote:
 ZFS cannot not buffer.  The reason is that ZFS likes to batch transactions 
 into
 as large a contiguous write to disk as possible.  The ZIL exists to
 support fsyn(2)
 operations that must commit before the rest of a ZFS transaction.  In
 other words:
 there's always some amount of buffering of writes in ZFS.
 In that case, ZFS doesn't suit my needs.

 Maybe.  See below.

 As to read buffering, why would you want to disable those?
 My application manages its own buffer and reads/writes go through that
 buffer first. I don't want double buffering.

 So your concern is that you don't want to pay twice the memory cost
 for buffering?

 If so, set primarycache as described earlier and drop the O_DSYNC flag.

 ZFS will then buffer your writes, but only for a little while, and you
 should want it to
 because ZFS will almost certainly do a better job of batching transactions 
 than
 your application would.  With ZFS you'll benefit from: advanced volume
 management,
 snapshots/clones, dedup, Merkle hash trees (i.e., corruption
 detection), encryption,
 and so on.  You'll almost certainly not be implementing any of those
 in your application...

 You still haven't told us what your application does.  Or why you want
 to get close
 to the metal.  Simply telling us that you need no buffering doesn't
 really help us
 help you -- with that approach you'll simply end up believing that ZFS is 
 not
 appropriate for your needs, even though it well might be.
 It's like the Berkeley DB on a high level, though it doesn't require
 transaction support, durability, etc. I'm measuring its performance
 and don't want FS buffer to pollute my results (hence directio).

 You're still mixing directio and O_DSYNC.

 You should do three things: a) set primarycache=metadata, b) set recordsize to
 whatever your application's page size is (e.g., 8KB), c) stop using O_DSYNC.

 Tell us how that goes.  I suspect the performance will be much better.

 Nico
 --


This is actually what I did for 2.a) in my original post. My concern
there is that ZFS' internal write buffering makes it hard to get a
grip on my application's behavior. I want to present my application's
raw I/O performance without too much outside factors... UFS plus
directio gives me exactly (or close to) that but ZFS doesn't...

Of course, in the final deployment, it would be great to be able to
take advantage of ZFS' advanced features such as I/O optimization.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Bill Sommerfeld


On 02/07/11 12:49, Yi Zhang wrote:

If buffering is on, the running time of my app doesn't reflect the
actual I/O cost. My goal is to accurately measure the time of I/O.
With buffering on, ZFS would batch up a bunch of writes and change
both the original I/O activity and the time.


if batching main pool writes improves the overall throughput of the 
system over a more naive i/o scheduling model, don't you want your users 
to see the improvement in performance from that batching?


why not set up a steady-state sustained workload that will run for 
hours, and measure how long it takes the system to commit each 1000 or 
1 transactions in the middle of the steady state workload?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-07 Thread Peter Jeremy

On 2011-Feb-07 14:22:51 +0800, Matthew Angelo bang...@gmail.com wrote:
I'm actually more leaning towards running a simple 7+1 RAIDZ1.
Running this with 1TB is not a problem but I just wanted to
investigate at what TB size the scales would tip.

It's not that simple.  Whilst resilver time is proportional to device
size, it's far more impacted by the degree of fragmentation of the
pool.  And there's no 'tipping point' - it's a gradual slope so it's
really up to you to decide where you want to sit on the probability
curve.

   I understand
RAIDZ2 protects against failures during a rebuild process.

This would be its current primary purpose.

  Currently,
my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
and worse case assuming this is 2 days this is my 'exposure' time.

Unless this is a write-once pool, you can probably also assume that
your pool will get more fragmented over time, so by the time your
pool gets to twice it's current capacity, it might well take 3 days
to rebuild due to the additional fragmentation.

One point I haven't seen mentioned elsewhere in this thread is that
all the calculations so far have assumed that drive failures were
independent.  In practice, this probably isn't true.  All HDD
manufacturers have their off days - where whole batches or models of
disks are cr*p and fail unexpectedly early.  The WD EARS is simply a
demonstration that it's WD's turn to turn out junk.  Your best
protection against this is to have disks from enough different batches
that a batch failure won't take out your pool.

PSU, fan and SATA controller failures are likely to take out multiple
disks but it's far harder to include enough redundancy to handle this
and your best approach is probably to have good backups.

I will be running hot (or maybe cold) spare.  So I don't need to
factor in Time it takes for a manufacture to replace the drive.

In which case, the question is more whether 8-way RAIDZ1 with a
hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).  In the latter
case, your hot spare is already part of the pool so you don't
lose the time-to-notice plus time-to-resilver before regaining
redundancy.  The downside is that actively using the hot spare
may increase the probability of it failing.

-- 
Peter Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Migrating zpool to new drives with 4K Sectors

2011-02-07 Thread Matt Connolly

Except for meta data which seems to be written in small pieces, wouldn't having 
a zfs record size being a multiple of 4k on a vdev that is 4k aligned work ok?

Or can the start of a zfs record that's 16kb for example start at any sector in 
the vdev?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] deduplication requirements

2011-02-07 Thread Erik Trimble


On 2/7/2011 1:06 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Michael

Core i7 2600 CPU
16gb DDR3 Memory
64GB SSD for ZIL (optional)

Would this produce decent results for deduplication of 16TB worth of pools
or would I need more RAM still?

What matters is the amount of unique data in your pool.  I'll just assume
it's all unique, but of course that's ridiculous because if it's all unique
then why would you want to enable dedup.  But anyway, I'm assuming 16T of
unique data.

The rule is a little less than 3G of ram for every 1T of unique data.  In
your case, 16*2.8 = 44.8G ram required in addition to your base ram
configuration.  You need at least 48G of ram.  Or less unique data.


To follow up on Ned's estimation, please let us know what kind of data 
you're planning on putting in the Dedup'd zpool. That can really give us 
a better idea as to the number of slabs that the pool will have, which 
is what drives dedup RAM and L2ARC usage.


You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* 
also want one for ZIL, depending on your write patterns).



In all honesty, these days, it doesn't pay to dedup a pool unless you 
can count on large amounts of common data.  Virtual Machine images, 
incremental backups, ISO images of data CD/DVDs, and some Video are your 
best bet. Pretty much everything else is going to cost you more in 
RAM/L2ARC than it's worth.



IMHO, you don't want Dedup unless you can *count* on a 10x savings factor.


Also, for reasons discussed here before, I would not recommend a Core i7 
for use as a fileserver CPU. It's an Intel Desktop CPU, and almost 
certainly won't support ECC Ram on your motherboard, and it seriously 
overpowered for your use.


See if you can find a nice socket AM3+ motherboard for a low-range 
Athlon X3/X4.  You can get ECC RAM for it (even in a desktop 
motherboard), it will cost less, and perform at least as well.


Dedup is not CPU intensive. Compression is, and you may very well want 
to enable that, but you're still very unlikely to hit a CPU bottleneck 
before RAM starvation or disk wait occurs.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] deduplication requirements

2011-02-07 Thread Erik Trimble


On 2/7/2011 1:06 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Michael

Core i7 2600 CPU
16gb DDR3 Memory
64GB SSD for ZIL (optional)

Would this produce decent results for deduplication of 16TB worth of pools
or would I need more RAM still?

What matters is the amount of unique data in your pool.  I'll just assume
it's all unique, but of course that's ridiculous because if it's all unique
then why would you want to enable dedup.  But anyway, I'm assuming 16T of
unique data.

The rule is a little less than 3G of ram for every 1T of unique data.  In
your case, 16*2.8 = 44.8G ram required in addition to your base ram
configuration.  You need at least 48G of ram.  Or less unique data.


To follow up on Ned's estimation, please let us know what kind of data 
you're planning on putting in the Dedup'd zpool. That can really give us 
a better idea as to the number of slabs that the pool will have, which 
is what drives dedup RAM and L2ARC usage.


You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* 
also want one for ZIL, depending on your write patterns).



In all honesty, these days, it doesn't pay to dedup a pool unless you 
can count on large amounts of common data.  Virtual Machine images, 
incremental backups, ISO images of data CD/DVDs, and some Video are your 
best bet. Pretty much everything else is going to cost you more in 
RAM/L2ARC than it's worth.



IMHO, you don't want Dedup unless you can *count* on a 10x savings factor.


Also, for reasons discussed here before, I would not recommend a Core i7 
for use as a fileserver CPU. It's an Intel Desktop CPU, and almost 
certainly won't support ECC Ram on your motherboard, and it seriously 
overpowered for your use.


See if you can find a nice socket AM3+ motherboard for a low-range 
Athlon X3/X4.  You can get ECC RAM for it (even in a desktop 
motherboard), it will cost less, and perform at least as well.


Dedup is not CPU intensive. Compression is, and you may very well want 
to enable that, but you're still very unlikely to hit a CPU bottleneck 
before RAM starvation or disk wait occurs.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Erik Trimble


On 2/7/2011 1:10 PM, Yi Zhang wrote:
[snip]

This is actually what I did for 2.a) in my original post. My concern
there is that ZFS' internal write buffering makes it hard to get a
grip on my application's behavior. I want to present my application's
raw I/O performance without too much outside factors... UFS plus
directio gives me exactly (or close to) that but ZFS doesn't...

Of course, in the final deployment, it would be great to be able to
take advantage of ZFS' advanced features such as I/O optimization.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


And, there's your answer. You seem to care about doing bare-metal I/O 
for tuning of your application, so you can do consistent measurements. 
Not for actual usage in production.


Therefore, do what's inferred in the above:  develop your app, using it 
on UFS w/directio to work out the application issues and tune.  When you 
deploy it, use ZFS and its caching techniques to get maximum (though not 
absolutely consistently measurable) performance for the already-tuned app.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-07 Thread Richard Elling

On Feb 7, 2011, at 1:07 PM, Peter Jeremy wrote:

 On 2011-Feb-07 14:22:51 +0800, Matthew Angelo bang...@gmail.com wrote:
 I'm actually more leaning towards running a simple 7+1 RAIDZ1.
 Running this with 1TB is not a problem but I just wanted to
 investigate at what TB size the scales would tip.
 
 It's not that simple.  Whilst resilver time is proportional to device
 size, it's far more impacted by the degree of fragmentation of the
 pool.  And there's no 'tipping point' - it's a gradual slope so it's
 really up to you to decide where you want to sit on the probability
 curve.

The tipping point won't occur for similar configurations. The tip
occurs for different configurations. In particular, if the size of the 
N+M parity scheme is very large and the resilver times become
very, very large (weeks) then a (M-1)-way mirror scheme can provide
better performance and dependability. But I consider these to be
extreme cases.

  I understand
 RAIDZ2 protects against failures during a rebuild process.
 
 This would be its current primary purpose.
 
 Currently,
 my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks
 and worse case assuming this is 2 days this is my 'exposure' time.
 
 Unless this is a write-once pool, you can probably also assume that
 your pool will get more fragmented over time, so by the time your
 pool gets to twice it's current capacity, it might well take 3 days
 to rebuild due to the additional fragmentation.
 
 One point I haven't seen mentioned elsewhere in this thread is that
 all the calculations so far have assumed that drive failures were
 independent.  In practice, this probably isn't true.  All HDD
 manufacturers have their off days - where whole batches or models of
 disks are cr*p and fail unexpectedly early.  The WD EARS is simply a
 demonstration that it's WD's turn to turn out junk.  Your best
 protection against this is to have disks from enough different batches
 that a batch failure won't take out your pool.

The problem with considering the failures as interdependent is that 
you cannot get the failure rate information from the vendors.  You could
guess, or use your own, but it would not always help you make a better design
decision.

 
 PSU, fan and SATA controller failures are likely to take out multiple
 disks but it's far harder to include enough redundancy to handle this
 and your best approach is probably to have good backups.

The top 4 items that fail most often, in no particular order, are: fans,
power supplies, memory, and disk. This is why you will see the enterprise
class servers use redundant fans, multiple high-quality power supplies,
ECC memory, and some sort of RAID.

 
 I will be running hot (or maybe cold) spare.  So I don't need to
 factor in Time it takes for a manufacture to replace the drive.
 
 In which case, the question is more whether 8-way RAIDZ1 with a
 hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2).  

In this case, raidz2 is much better for dependability because the spare
is already resilvered.  It also performs better, though the dependability
gains tend to be bigger than the performance gains.

 In the latter
 case, your hot spare is already part of the pool so you don't
 lose the time-to-notice plus time-to-resilver before regaining
 redundancy.  The downside is that actively using the hot spare
 may increase the probability of it failing.

No. The disk failure rate data does not conclusively show that activity
causes premature failure. Other failure modes dominate.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Richard Elling

On Feb 7, 2011, at 1:10 PM, Yi Zhang wrote:
 
 This is actually what I did for 2.a) in my original post. My concern
 there is that ZFS' internal write buffering makes it hard to get a
 grip on my application's behavior. I want to present my application's
 raw I/O performance without too much outside factors... UFS plus
 directio gives me exactly (or close to) that but ZFS doesn't...

In the bad old days when processors only had one memory controller,
one could make an argument that not copying was an important optimization.
Today, with the fast memory controllers (plural) we have, memory copies 
don't hurt very much.  Other factors will dominate. Of course, with dtrace
it should be relatively easy to measure the copy.

 Of course, in the final deployment, it would be great to be able to
 take advantage of ZFS' advanced features such as I/O optimization.

Nice save :-)  otherwise we wonder why you don't just use raw disk if you
are so concerned about memory copies :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Drive i/o anomaly

2011-02-07 Thread Richard Elling

Observation below...

On Feb 4, 2011, at 7:10 PM, Matt Connolly wrote:

 Hi, I have a low-power server with three drives in it, like so:
 
 
 matt@vault:~$ zpool status
  pool: rpool
 state: ONLINE
 scan: resilvered 588M in 0h3m with 0 errors on Fri Jan  7 07:38:06 2011
 config:
 
NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror-0ONLINE   0 0 0
c8t1d0s0  ONLINE   0 0 0
c8t0d0s0  ONLINE   0 0 0
cache
  c12d0s0 ONLINE   0 0 0
 
 errors: No known data errors
 
 
 I'm running netatalk file sharing for mac, and using it as a time machine 
 backup server for my mac laptop.
 
 When files are copying to the server, I often see periods of a minute or so 
 where network traffic stops. I'm convinced that there's some bottleneck in 
 the storage side of things because when this happens, I can still ping the 
 machine and if I have an ssh window, open, I can still see output from a 
 `top` command running smoothly. However, if I try and do anything that 
 touches disk (eg `ls`) that command stalls. At the time it comes good, 
 everything comes good, file copies across the network continue, etc.
 
 If I have a ssh terminal session open and run `iostat -nv 5` I see something 
 like this:
 
 
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
1.2   36.0  153.6 4608.0  1.2  0.3   31.99.3  16  18 c12d0
0.0  113.40.0 7446.7  0.8  0.17.00.5  15   5 c8t0d0
0.2  106.44.1 7427.8  4.0  0.1   37.81.4  93  14 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.4   73.2   25.7 9243.0  2.3  0.7   31.69.8  34  37 c12d0
0.0  226.60.0 24860.5  1.6  0.27.00.9  25  19 c8t0d0
0.2  127.63.4 12377.6  3.8  0.3   29.72.2  91  27 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   44.20.0 5657.6  1.4  0.4   31.79.0  19  20 c12d0
0.2   76.04.8 9420.8  1.1  0.1   14.21.7  12  13 c8t0d0
0.0   16.60.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.20.0   25.6  0.0  0.00.32.3   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   11.00.0 1365.6  9.0  1.0  818.1   90.9 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.20.00.10.0  0.0  0.00.1   25.4   0   1 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   17.60.0 2182.4  9.0  1.0  511.3   56.8 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   16.60.0 2058.4  9.0  1.0  542.1   60.2 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   15.80.0 1959.2  9.0  1.0  569.6   63.3 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.20.00.10.0  0.0  0.00.10.1   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   17.40.0 2157.6  9.0  1.0  517.2   57.4 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   18.20.0 2256.8  9.0  1.0  494.5   54.9 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c12d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c8t0d0
0.0   14.80.0 1835.2  9.0  1.0  608.1   67.5 100 100 c8t1d0
extended device statistics  
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.20.00.10.0  0.0  0.00.10.1   0   0 c12d0
0.01.40.00.6  0.0  0.00.00.2   0   0 c8t0d0
0.0   49.00.0 6049.6  6.7  0.5  137.6   11.2

Re: [zfs-discuss] deduplication requirements

2011-02-07 Thread taemun

On 6 February 2011 01:34, Michael michael.armstr...@gmail.com wrote:

 Hi guys,

 I'm currently running 2 zpools each in a raidz1 configuration, totally
 around 16TB usable data. I'm running it all on an OpenSolaris based box with
 2gb memory and an old Athlon 64 3700 CPU, I understand this is very poor and
 underpowered for deduplication, so I'm looking at building a new system, but
 wanted some advice first, here is what i've planned so far:

 Core i7 2600 CPU
 16gb DDR3 Memory
 64GB SSD for ZIL (optional)


http://ark.intel.com/Product.aspx?id=52213
http://ark.intel.com/Product.aspx?id=52213The desktop Core i* range
doesn't support ECC ram at all, this could potentially be a pool breaker if
you get a flipped bit in the wrong place (a significant metadata block).
Just something to keep in mind. Also, Intel have issued a recall (ish) for
all of the 6 series chipsets released so far, the PLL unit for the 3gbit
SATA ports on the chipset is driven too hard and will likely degrade over
time (5~15% failure rate over three years). They are talking about a
March~April time to fix in the channel. If you don't plan on using the 3gbit
SATA ports, then you're fine.

Intel will make 1155 Xeon's at some point, ie
http://en.wikipedia.org/wiki/List_of_future_Intel_microprocessors#.22Sandy_Bridge.22_.2832_nm.29_8
They support ECC (just check for a specific QVL after launch, DDR3 ECC
isn't necessarily the only thing you need to look for). I think the Feb 20
release date may have been pushed for the chipset respin.

Cheers,
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM - No need for TRIM

2011-02-07 Thread Bob Friesenhahn


On Sun, 6 Feb 2011, Orvar Korvar wrote:


1) Using SSD without TRIM is acceptable. The only drawback is that without 
TRIM, the SSD will write much more, which effects life time. Because when the 
SSD has written enough, it will break.


Why do you think that the SSD should necessarily write much more?  I 
don't follow that conclusion.


If I can figure out how to design a SSD which does not necessarily 
write much more, I suspect that an actual SSD designer can do the 
same.


USB sticks and Compact Flash cards need not apply. :-)

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM - No need for TRIM

2011-02-07 Thread Eric D. Mudama


On Mon, Feb  7 at 20:43, Bob Friesenhahn wrote:

On Sun, 6 Feb 2011, Orvar Korvar wrote:


1) Using SSD without TRIM is acceptable. The only drawback is that without 
TRIM, the SSD will write much more, which effects life time. Because when the 
SSD has written enough, it will break.


Why do you think that the SSD should necessarily write much more?  I 
don't follow that conclusion.


If I can figure out how to design a SSD which does not necessarily 
write much more, I suspect that an actual SSD designer can do the 
same.


Blocks/sectors marked as being TRIM'd do not need to be maintained by
the garbage collection engine.  Depending on the design of the SSD,
this can significantly reduce the write amplification of the SSD.


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Sil3124 Sata controller for ZFS on Sparc OpenSolaris Nevada b130

2011-02-07 Thread Jerry Kemp

As part of a small home project, I have purchased a SIL3124 hba in hopes 
of attaching an external drive/drive enclosure via eSATA.


The host in question is an old Sun Netra T1 currently running 
OpenSolaris Nevada b130.


The card in question is this Sil3124 card:

http://www.newegg.com/product/product.aspx?item=N82E16816124003

although I did not purchase it from Newegg.  I specifically purchased 
this card as I have seen specific reports of it working under 
Solaris/OpenSolaris distro's on several Solaris mailing lists.


After installing the card, and associated components, I did numerous 
things in an attempt to see the single drive attached to my Netra, to 
include reconfigure boot, several different devfsadm commands, and 
looking for components using scanpci and cfgadm.  Although I am not 
functional yet (I can't see my drive with format or format -e), I 
believe I see my hba with both prtdiag and prtconf.  I will post some 
additional system information at the bottom of this note.


To cut to the chase, after jumping on Yahoo for some RTFM stuff, it 
looks like there is a system package called *SUNWsi3124*.  I looked on 
my Netra, and it isn't there.  I reviewed my OpenSolaris Nevada b130 
iso, and it also doesn't have a SUNWsi3124 package.


On a whim, I looked on my Sun Ultra20m2 system (X64 AMD system) which is 
also running OpenSolaris Nevada b130, and this package is there.  So it 
looks like the SUNWsi3124 package is x86/x64 only?


I don't know if this specifically is why my Netra doesn't see my eSATA 
drive, but the SUNWsi3124 package is the best lead I have so far.


Thanks for any comments,

Jerry

.
/
=

# prtdiag
System Configuration:  Sun Microsystems  sun4u Netra T1 200 
(UltraSPARC-IIe 500MHz)

System clock frequency: 100 MHz
Memory size: 2048 Megabytes

= CPUs =

Run   Ecache   CPUCPU
Brd  CPU   Module   MHz MBImpl.   Mask
---  ---  ---  -  --  --  
 0 0 0  500 0.2   13   1.4


= IO Cards =

 Bus#  Freq
Brd  Type  MHz   Slot  Name  Model
---         
--
 0   PCI-1  3312   ebus 

 0   PCI-1  33 3   pmu-pci10b9,7101 

 0   PCI-1  33 3   lomp 

 0   PCI-1  33 7   isa 

 0   PCI-1  3312   network-pci108e,1101  SUNW,pci-eri 

 0   PCI-1  3312   usb-pci108e,1103.1 

 0   PCI-1  3313   ide-pci10b9,5229 

 0   PCI-1  33 5   network-pci108e,1101  SUNW,pci-eri 

 0   PCI-1  33 5   usb-pci108e,1103.1 

 0   PCI-2  33 8   scsi-glm  Symbios,53C896 

 0   PCI-2  33 8   scsi-glm  Symbios,53C896 

 0   PCI-2  33 5   raid-pci1095,7124--- 




No failures found in System
===
#

.
/
=

relevant section from prtconf -v

   raid (driver not attached)
Hardware properties:
name='compatible' type=string items=3
value='pci1095,7124' + 'pci1095,3124' + 
'pciclass,010400'


.
/
=
output from prtdev.ksh script, from here
http://bolthole.com/solaris/HCL/


VendorID=0x1095, DeviceID=0x3124
Sub VendorID=0x1095, Sub DeviceID=0x7124
name:  'raid'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

51 matches

Mail list logo