Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-30 Thread John Martin

On 07/29/12 14:52, Bob Friesenhahn wrote:


My opinion is that complete hard drive failure and block-level media
failure are two totally different things.


That would depend on the recovery behavior of the drive for
block-level media failure.  A drive whose firmware does excessive
(reports of up to 2 minutes) retries of a bad sector may be
indistinguishable from a failed drive.  See previous discussions
of the firmware differences between desktop and enterprise drives.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread John Martin

On 07/19/12 19:27, Jim Klimov wrote:


However, if the test file was written in 128K blocks and then
is rewritten with 64K blocks, then Bob's answer is probably
valid - the block would have to be re-read once for the first
rewrite of its half; it might be taken from cache for the
second half's rewrite (if that comes soon enough), and may be
spooled to disk as a couple of 64K blocks or one 128K block
(if both changes come soon after each other - within one TXG).


What are the values for zfs_txg_synctime_ms and zfs_txg_timeout
on this system (FreeBSD, IIRC)?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-10 Thread John Martin

On 07/10/12 19:56, Sašo Kiselkov wrote:

Hi guys,

I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
implementation to supplant the currently utilized sha256. On modern
64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
slower than many of the SHA-3 candidates, so I went out and did some
testing (details attached) on a possible new hash algorithm that might
improve on this situation.


Is the intent to store the 512 bit hash or truncate to 256 bit?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-04 Thread John Martin

On 07/04/12 16:47, Nico Williams wrote:


I don't see that the munmap definition assures that anything is written to
disk.  The system is free to buffer the data in RAM as long as it likes
without writing anything at all.


Oddly enough the manpages at the Open Group don't make this clear.  So
I think it may well be advisable to use msync(3C) before munmap() on
MAP_SHARED mappings.  However, I think all implementors should, and
probably all do (Linux even documents that it does) have an implied
msync(2) when doing a munmap(2).  I really makes no sense at all to
have munmap(2) not imply msync(3C).


This assumes msync() has the behavior you expect.  See:

  http://pubs.opengroup.org/onlinepubs/009695399/functions/msync.html

In particular, the paragraph starting with For mappings to files, 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks

2012-06-16 Thread John Martin

On 06/16/12 12:23, Richard Elling wrote:

On Jun 15, 2012, at 7:37 AM, Hung-Sheng Tsao Ph.D. wrote:


by the way
when you format start with cylinder 1 donot use 0


There is no requirement for skipping cylinder 0 for root on Solaris, and there
never has been.


Maybe not for core Solaris, but it is still wise advice
if you plan to use Oracle ASM.  See section 3.3.1.4, 2c:


http://docs.oracle.com/cd/E11882_01/install.112/e24616/storage.htm#CACHGBAH

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks

2012-06-15 Thread John Martin

On 06/15/12 15:52, Cindy Swearingen wrote:


Its important to identify your OS release to determine if
booting from a 4k disk is supported.


In addition, whether the drive is really 4096p or 512e/4096p.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-29 Thread John Martin

On 05/28/12 08:48, Nathan Kroenert wrote:


Looking to get some larger drives for one of my boxes. It runs
exclusively ZFS and has been using Seagate 2TB units up until now (which
are 512 byte sector).

Anyone offer up suggestions of either 3 or preferably 4TB drives that
actually work well with ZFS out of the box? (And not perform like
rubbish)...

I'm using Oracle Solaris 11 , and would prefer not to have to use a
hacked up zpool to create something with ashift=12.


Are you replacing a failed drive or creating a new pool?

I had a drive in a mirrored pool recently fail.  Both
drives were 1TB Seagate ST310005N1A1AS-RK with 512 byte sectors.
All the 1TB Seagate boxed drives I could find with the same
part number on the box (with factory seals in place)
were really ST1000DM003-9YN1 with 512e/4196p.  Just being
cautious, I ended up migrating the pools over to a pair
of the new drives.  The pools were created with ashift=12
automatically:

  $ zdb -C | grep ashift
  ashift: 12
  ashift: 12
  ashift: 12

Resilvering the three pools concurrently went fairly quickly:

  $ zpool status
scan: resilvered 223G in 2h14m with 0 errors on Tue May 22 21:02:32 
2012
scan: resilvered 145G in 4h13m with 0 errors on Tue May 22 23:02:38 
2012
scan: resilvered 153G in 3h44m with 0 errors on Tue May 22 22:30:51 
2012


What performance problem do you expect?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-29 Thread John Martin

On 05/29/12 08:35, Nathan Kroenert wrote:

Hi John,

Actually, last time I tried the whole AF (4k) thing, it's performance
was worse than woeful.

But admittedly, that was a little while ago.

The drives were the seagate green barracuda IIRC, and performance for
just about everything was 20MB/s per spindle or worse, when it should
have been closer to 100MB/s when streaming. Things were worse still when
doing random...

I'm actually looking to put in something larger than the 3*2TB drives
(triple mirror for read perf) this pool has in it - preferably 3 * 4TB
drives. (I don't want to put in more spindles - just replace the current
ones...)

I might just have to bite the bullet and try something with current SW. :).



Raw read from one of the mirrors:

#  timex dd if=/dev/rdsk/c0t2d0s2 of=/dev/null bs=1024000 count=1
1+0 records in
1+0 records out

real  49.26
user   0.01
sys0.27


filebench filemicro_seqread reports an impossibly high number (4GB/s)
so the ARC is likely handling all reads.

The label on the boxes I bought say:

  1TB 32MB INTERNAL KIT 7200
  ST310005N1A1AS-RK
  S/N: ...
  PN:9BX1A8-573

The drives in the box were really
ST1000DM003-9YN162 with 64MB of cache.
I have multiple pools on each disk so the
cache should be disabled.  The drive reports
512 byte logical sectors and 4096 physical sectors.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-29 Thread John Martin

On 05/29/12 07:26, bofh wrote:


ashift:9  is that standard?


Depends on what the drive reports as physical sector size.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-25 Thread John Martin

On 01/25/12 09:08, Edward Ned Harvey wrote:


Assuming the failure rate of drives is not linear, but skewed toward higher 
failure rate after some period of time (say, 3 yrs)  ...


See section 3.1 of the Google study:

  http://research.google.com/archive/disk_failures.pdf

although section 4.2 of the Carnegie Mellon study
is much more supportive of the assumption.

  http://www.usenix.org/events/fast07/tech/schroeder/schroeder.pdf
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What is your data error rate?

2012-01-24 Thread John Martin

On 01/24/12 17:06, Gregg Wonderly wrote:

What I've noticed, is that when I have my drives in a situation of small
airflow, and hence hotter operating temperatures, my disks will drop
quite quickly.


While I *believe* the same thing and thus have over provisioned
airflow in my cases (for both drives and memory), there
are studies which failed to find a strong correlation between
drive temperature and failure rates:

  http://research.google.com/archive/disk_failures.pdf

  http://www.usenix.org/events/fast07/tech/schroeder.html

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss by memory corruption?

2012-01-16 Thread John Martin

On 01/16/12 11:08, David Magda wrote:


The conclusions are hardly unreasonable:


While the reliability mechanisms in ZFS are able to provide reasonable
robustness against disk corruptions, memory corruptions still remain a
serious problem to data integrity.


I've heard the same thing said (use ECC!) on this list many times over
the years.


I believe the whole paragraph quoted from the USENIX paper above is
important:

  While the reliability mechanisms in ZFS are able to
  provide reasonable robustness against disk corruptions,
  memory corruptions still remain a serious problem to
  data integrity. Our results for memory corruptions in-
  dicate cases where bad data is returned to the user, oper-
  ations silently fail, and the whole system crashes. Our
  probability analysis shows that one single bit flip has
  small but non-negligible chances to cause failures such
  as reading/writing corrupt data and system crashing.

The authors provide probability calculations in section 6.3
for single bit flips.  ECC provides detection and correction
of single bit flips.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-09 Thread John Martin

On 01/08/12 20:10, Jim Klimov wrote:


Is it true or false that: ZFS might skip the cache and
go to disks for streaming reads?


I don't believe this was ever suggested.  Instead, if
data is not already in the file system cache and a
large read is made from disk should the file system
put this data into the cache?

BTW, I chose the term streaming to be a subset
of sequential where the access pattern is sequential but
at what appears to be artificial time intervals.
The suggested pre-read of the entire file would
be a simple sequential read done as quickly
as the hardware allows.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-09 Thread John Martin

On 01/08/12 10:15, John Martin wrote:


I believe Joerg Moellenkamp published a discussion
several years ago on how L1ARC attempt to deal with the pollution
of the cache by large streaming reads, but I don't have
a bookmark handy (nor the knowledge of whether the
behavior is still accurate).


http://www.c0t0d0s0.org/archives/5329-Some-insight-into-the-read-cache-of-ZFS-or-The-ARC.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread John Martin

On 01/08/12 09:30, Edward Ned Harvey wrote:


In the case of your MP3 collection...  Probably the only thing you can do is
to write a script which will simply go read all the files you predict will
be read soon.  The key here is the prediction - There's no way ZFS or
solaris, or any other OS in the present day is going to intelligently
predict which files you'll be requesting soon.



The other prediction is whether the blocks will be reused.
If the blocks of a streaming read are only used once, then
it may be wasteful for a file system to allow these blocks
to placed in the cache.  If a file system purposely
chooses to not cache streaming reads, manually scheduling a
pre-read of particular files may simply cause the file to be read
from disk twice: on the manual pre-read and when it is read again
by the actual application.

I believe Joerg Moellenkamp published a discussion
several years ago on how L1ARC attempt to deal with the pollution
of the cache by large streaming reads, but I don't have
a bookmark handy (nor the knowledge of whether the
behavior is still accurate).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread John Martin

On 01/08/12 11:30, Jim Klimov wrote:


However for smaller servers, such as home NASes which have
about one user overall, pre-reading and caching files even
for a single use might be an objective per se - just to let
the hard-disks spin down. Say, if I sit down to watch a
movie from my NAS, it is likely that for 90 or 120 minutes
there will be no other IO initiated by me. The movie file
can be pre-read in a few seconds, and then most of the
storage system can go to sleep.


Isn't this just a more extreme case of prediction?
In addition to the file system knowing there will only
be one client reading 90-120 minutes of (HD?) video
that will fit in the memory of a small(er) server,
now the hard drive power management code also knows there
won't be another access for 90-120 minutes so it is OK
to spin down the hard drive(s).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bad seagate drive?

2011-09-12 Thread John Martin

On 09/12/11 10:33, Jens Elkner wrote:


Hmmm, at least if S11x, ZFS mirror, ICH10 and cmdk (IDE) driver is involved,
I'm 99.9% confident, that a while turns out to be some days or weeks, only
- no matter what Platinium-Enterprise-HDDs you use ;-)


On Solaris 11 Express with a dual drive mirror, ICH10 and the AHCI
driver (not sure why you would purposely choose to run in IDE mode)
resilvering a 1TB drive (Seagate ST310005N1A1AS-RK) went at a rate of
3.2GB/min.  Deduplication was not enabled.  Only hours for a 55%
full mirror, not days or weeks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] BAD WD drives - defective by design?

2011-09-06 Thread John Martin

http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] matching zpool versions to development builds

2011-08-08 Thread John Martin

Is there a list of zpool versions for development builds?

I found:

  http://blogs.oracle.com/stw/entry/zfs_zpool_and_file_system

where it says Solaris 11 Express is zpool version 31, but my
system has BEs back to build 139 and I have not done a zpool upgrade
since installing this system but it reports on the current
development build:

  # zpool upgrade -v
  This system is currently running ZFS pool version 33.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss