Re: [zfs-discuss] VDI iops with caching

2013-01-04 Thread Eric D. Mudama

On Thu, Jan  3 at 20:38, Geoff Nordli wrote:

   On Jan 2, 2013, at 8:45 PM, Geoff Nordli geo...@gnaa.net wrote:


   I am looking at the performance numbers for the Oracle VDI admin guide.

   http://docs.oracle.com/html/E26214_02/performance-storage.html

   From my calculations for 200 desktops running Windows 7 knowledge user
   (15 iops) with a 30-70 read/write split it comes to 5100 iops. Using
   7200 rpm disks the requirement will be 68 disks.

   This doesn't seem right, because if you are using clones with caching,
   you should be able to easily satisfy your reads from ARC and L2ARC.  As
   well, Oracle VDI by default caches writes; therefore the writes will be
   coalesced and there will be no ZIL activity.


Yes, I would like to stick with HDDs.

I am just not quite sure what quite a few desktops mean. 


I thought for sure there would be lots of people around that have done small
deployments using a standard ZFS deployment. 


Even a single modern SSD should be able to provide hundreds of
gigabytes of fast L2ARC to your system, and can scale as your userbase
grows for a relatively small initial investment.  This is actually
about the perfect use case for an L2ARC on SSD.


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] any more efficient way to transfer snapshot between two hosts than ssh tunnel?

2012-12-14 Thread Eric D. Mudama

On Fri, Dec 14 at  9:29, Fred Liu wrote:




We have found mbuffer to be the fastest solution.   Our rates for large
transfers on 10GbE are:

280MB/smbuffer
220MB/srsh
180MB/sHPN-ssh unencrypted
 60MB/s standard ssh

The tradeoff mbuffer is a little more complicated to script;   rsh is,
well, you know;  and hpn-ssh requires rebuilding ssh and (probably)
maintaining a second copy of it.

 -- Trey Palmer



In 10GbE env, even 280MB/s is not a so decent result. Maybe the alternative 
could
be a two-step way. Putting snapshots via NFS/iSCSI and receiving them locally.
But that is not perfect.


Even with infinite wire speed, you're bound by the ability of the
source server to generate the snapshot stream and the ability of the
destination server to write the snapshots to the media.

Our little servers in-house using ZFS don't read/write that fast when
pulling snapshot contents off the disks, since they're essentially
random access on a server that's been creating/deleting snapshots for
a long time.

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel DC S3700

2012-11-14 Thread Eric D. Mudama

On Wed, Nov 14 at  0:28, Jim Klimov wrote:
All in all, I can't come up with anything offensive against it 
quickly ;) One possible nit regards the ratings being geared towards 
4KB block

(which is not unusual with SSDs), so it may be further from announced
performance with other block sizes - i.e. when caching ZFS metadata.


Would an ashift of 12 conceivably address that issue?


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good tower server for around 1,250 USD?

2012-03-22 Thread Eric D. Mudama

On Thu, Mar 22 at 18:30, The Honorable Senator and Mrs. John Blutarsky wrote:

Ladies and Gentlemen,

I'm thinking about spending around 1,250 USD for a tower format (desk side)
server with RAM but without disks. I'd like to have 16G ECC RAM as a
minimum and ideally 2 or 3 times that amount and I'd like for the case to
have room for at least 6 drives, more would be better but not essential. I
want to run Solaris 10 and possibly upgrade to Solaris 11 if I like
it. Right now I have nothing to run Solaris 11 on and I know Solaris 10 well
enough to know it will do what I want.

This will be a do-everything machine. I will use it for development, hosting
various apps in zones (web, file server, mail server etc.) and running other
systems (like a Solaris 11 test system) in VirtualBox. Ultimately I would
like to put it under Solaris support so I am looking for something
officially approved. The problem is there are so many systems on the HCL I
don't know where to begin. One of the Supermicro super workstations looks
good and I know they have good a reputation but Dell has better sales
channels where I live and I could get one of those or even an HP more easily
than a Supermicro as far as I know. I will be checking more on this.

I have a bunch of white box systems but I don't know anybody capable of
building a server grade box so I'm probably going to buy off the shelf.

Can anybody tell me is what I am looking for going to be available at this
price point and if not, what should I expect to pay? If you have experience
with any of the commodity server towers good or bad with Solaris and ZFS I'd
like to hear your opinions. I am refraining for asking for advice on drives
because I see the list has a few thousand posts archived on this topic and
until I go over some of those I don't want to ask about that subject just
yet. Thanks for the help.


Most of the supermicro stuff works great for me.


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any HP Servers recommendation for Openindiana (Capacity Server) ?

2012-01-06 Thread Eric D. Mudama

On Wed, Jan  4 at 13:55, Fajar A. Nugraha wrote:

Were the Dell cards able to present the disks as JBOD without any
third-party-flashing involved?


Yes, the ones I have tested (SAS 6/iR) worked as expected (bare drives
exposed to ZFS) with no changes to drive firmware.  I have not tested
the H200/H700 adapters, since we don't really need 6Gbit/s and are
still ordering our systems with SAS 6/iR.

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any HP Servers recommendation for Openindiana (Capacity Server) ?

2012-01-03 Thread Eric D. Mudama

On Tue, Jan  3 at  8:03, Gary Driggs wrote:

I can't comment on their 4U servers but HP's 12U includwd SAS
controllers rarely allow JBOD discovery of drives. So I'd recommend an
LSI card and an external storage chassis like those available from
Promise and others.


That was what got us with the HP boxes was the unsupported RAID cards.
We ended up getting Dell T610 boxes with SAS6i/R cards, which are
properly supported in Solaris/OI.

Supposedly the H200/H700 cards are just their name for the 6gbit LSI
SAS cards, but I haven't tested them personally.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow zfs send/recv speed

2011-11-16 Thread Eric D. Mudama

On Wed, Nov 16 at  9:35, David Dyer-Bennet wrote:


On Tue, November 15, 2011 17:05, Anatoly wrote:

Good day,

The speed of send/recv is around 30-60 MBytes/s for initial send and
17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk
to 100+ disks in pool. But the speed doesn't vary in any degree. As I
understand 'zfs send' is a limiting factor. I did tests by sending to
/dev/null. It worked out too slow and absolutely not scalable.
None of cpu/memory/disk activity were in peak load, so there is of room
for improvement.


What you're probably seeing with incremental sends is that the disks being
read are hitting their IOPS limits.  Zfs send does random reads all over
the place -- every block that's changed since the last incremental send is
read, in TXG order.  So that's essentially random reads all of the disk.


Anatoly didn't state whether his 160GB file test was done on a virgin
pool, or whether it was allocated out of an existing pool.  If the
latter, your comment is the likely explanation.  If the former, your
comment wouldn't explain the slow performance.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow zfs send/recv speed

2011-11-15 Thread Eric D. Mudama

On Wed, Nov 16 at  3:05, Anatoly wrote:

Good day,

The speed of send/recv is around 30-60 MBytes/s for initial send and 
17-25 MBytes/s for incremental. I have seen lots of setups with 1 
disk to 100+ disks in pool. But the speed doesn't vary in any degree. 
As I understand 'zfs send' is a limiting factor. I did tests by 
sending to /dev/null. It worked out too slow and absolutely not 
scalable.
None of cpu/memory/disk activity were in peak load, so there is of 
room for improvement.


My belief is that initial/incremental may be affecting it because of
initial versus incremental efficiency of the data layout in the pools,
not because of something inherent in the send/recv process itself.

There are various send/recv improvements (e.g. don't use SSH as a
tunnel) but even that shouldn't be capping you at 17MBytes/sec.

My incrementals get me ~35MB/s consistently.  Each incremental is
10-50GB worth of transfer.

cheap gig switch, no jumbo frames
Source = 2 mirrored vdevs + l2arc ssd, CPU = xeon E5520, 6GB RAM
Destination = 4-drive raidz1, CPU = c2d E4500 @2.2GHz, 2GB RAM
tunnel is un-tuned SSH

I found these guys have the same result - around 7 Mbytes/s for 
'send' and 70 Mbytes for 'recv'.

http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html


Their data doesn't match mine.

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Eric D. Mudama

On Wed, Sep 21 at 10:32, Markus Kovero wrote:

I'd say price range around same than X25-E was, main priorities
being predictable latency and performance. Also write wear shouldn't
get an issue when writing 150MB/s 24/7 365.


At 150MB/s continuously, you're writing 5PB/year (assuming a write
amplification of 1.0) In practice, wAmp is often much higher,
depending on the workload.

How long do you plan on having this device last?  How much retention
do you need in your application?  What is your workload?

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pure SSD Pool

2011-07-16 Thread Eric D. Mudama

On Tue, Jul 12 at 23:44, Jim Klimov wrote:

2011-07-12 23:14, Eric Sproul пишет:
So finding drives that keep more space in reserve is key to getting 
consistent performance under ZFS.

I think I've read in a number of early SSD reviews
(possibly regarding Intel devices - not certain now)
that the vendor provided some low-level formatting
tools which essentially allowed the user to tune how
much flash would be useable and how much would
be set aside as the reserve...

Perhaps this rumour is worth an exploration too -
do any modern SSDs have similar tools to switch
them between capacity and performance modes,
or such?


It doesn't require special tools, just partition the device.

Since ZFS will stay within a partition boundary if told to, that
should allow you to guarantee a certain minimum reserve area available
for other purposes.

e.g. Take a 100GB drive and partition it to 80GB.  Assuming the
original drive was a 100GB/100GiB design, you now have (100*0.07)+20
GB of spare area, which depending on the design, may significantly
lower write amplification and thus increase performance on a device
that is full.

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-15 Thread Eric D. Mudama

On Wed, Jun 15 at  7:29, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Richard Elling

That would suck worse.


Don't mind Richard.  He is of the mind that ZFS is perfect for everything
just the way it is, and anybody who wants anything different should adjust
their thought process.

Richard, just because it's not something you want, doesn't mean you should
rain on somebody else's parade.  If Simon wants something like that, kudos
to him.

I know I've certainly had many situations where people wanted to snapshot or
rev individual files everytime they're modified.  As I said - perfect
example is Google Docs.  Yes it is useful.  But no, it's not what ZFS does.


suck worse = every single file would show a snapshot version for every
change anywhere in the filesystem, not just the changes unique to that file.

Imagine scrolling through a few hundred thousand snapshots in the
windows old version dialog because your 5000 files were getting
edited 2-3 times/day for a month.  Imagine trying to parse the results
of 'zfs list -t snapshot'.  Picture the disaster of that system in 5
years of operation.

IMO, this problem begs for one of today's content management systems
plus 10 minutes of training on how to use it effectively.  Save the
snapshotting for the periodic backup of the CMS system and/or users's
systems.  Sure, map work areas via NFS or CIFS, and give them a
time-machine like picture of history for that work area (hourly for a
day, daily for a week, weekly for a month, monthly for a year, etc.)

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)

2011-06-14 Thread Eric D. Mudama

On Tue, Jun 14 at  8:04, Paul Kraus wrote:

On Mon, Jun 13, 2011 at 6:01 PM, Erik Trimble erik.trim...@oracle.com wrote:


I'd have to re-look at the exact numbers, but, I'd generally say that
2x6raidz2 vdevs would be better than either 1x12raidz3 or 4x3raidz1 (or
3x4raidz1, for a home server not looking for super-critical protection (in
which case, you should be using mirrors with spares, not raidz*).


I saw some stats a year or more ago that indicated the MTDL for raidZ2
was better than for a 2-way mirror. In order of best to worst I
remember the rankings as:

raidZ3 (least likely to lose data)
3-way mirror
raidZ2
2-way mirror
raidZ1 (most likely to lose data)

This is for Mean Time to Data Loss, or essentially the odds of losing
_data_ due to one (or more) drive failures. I do not know if this took
number of devices per vdev and time to resilver into account.
Non-redundant configurations were not even discussed. This information
came out of Sun (pre-Oracle) and _may_ have been traceable back to
Brendan Gregg.


Google mttdl raidz zfs digs up:

http://blogs.oracle.com/relling/entry/zfs_raid_recommendations_space_performance
http://blogs.oracle.com/relling/entry/raid_recommendations_space_vs_mttdl
http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html

I think the second picture is the one you were thinking of.  The 3rd
link adds raidz3 data to the charts.



--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA disk perf question

2011-06-03 Thread Eric D. Mudama

On Thu, Jun  2 at 20:49, Erik Trimble wrote:
Nope. In terms of actual, obtainable IOPS, a 7200RPM drive isn't 
going to be able to do more than 200 under ideal conditions, and 
should be able to manage 50 under anything other than the 
pedantically worst-case situation. That's only about a 50% deviation, 
not like an order of magnitude or so.


Most cache-enabled 7200RPM drives can do 20K+ sequential IOPS at small
block sizes, up close to their peak transfer rate.  


For random IO, I typically see 80 IOPS for unqueued reads, 120 for
queued reads/writes with cache disabled, and maybe 150-200 for cache
enabled writes.  The above are all full-stroke, so the average seek is
1/3 stroke (unqueued).  On a smaller data set where the drive dwarfs
the data set, average seek distance is much shorter and the resulting
IOPS can be quite a bit higher.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Eric D. Mudama

On Tue, May 31 at  8:52, Paul Kraus wrote:

   When we initially configured a large (20TB) files server about 5
years ago, we went with multiple zpools and multiple datasets (zfs) in
each zpool. Currently we have 17 zpools and about 280 datasets.
Nowhere near the 10,000+ you intend. We are moving _away_ from the
many dataset model to one zpool and one dataset. We are doing this for
the following reasons:

1. manageability
2. space management (we have wasted space in some pools while others
are starved)
3. tool speed

   I do not have good numbers for time to do some of these operations
as we are down to under 200 datasets (1/3 of the way through the
migration to the new layout). I do have log entries that point to
about a minute to complete a `zfs list` operation.


It would be interesting to see if you still had issues (#3) with 1 pool and
your 280 datasets.  It would definitely eliminate #2.

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mapping sas address to physical disk in enclosure

2011-05-19 Thread Eric D. Mudama

On Thu, May 19 at  9:55, Evaldas Auryla wrote:
Hi, we have SunFire X4140 connected to Dell MD1220 SAS enclosure, 
single path, MPxIO disabled, via LSI SAS9200-8e HBA. Disks are 
visible with sas-addresses such as this in zpool status output:


   NAME   STATE READ WRITE CKSUM
   cuve   ONLINE   0 0 0
 mirror-0 ONLINE   0 0 0
   c9t5000C50025D5AF66d0  ONLINE   0 0 0
   c9t5000C50025E5A85Ad0  ONLINE   0 0 0
 mirror-1 ONLINE   0 0 0
   c9t5000C50025D591BEd0  ONLINE   0 0 0
   c9t5000C50025E1BD56d0  ONLINE   0 0 0
  ...

Is there an easy way to map these sas-addresses to the physical disks 
in enclosure ?


You should be able to match the '000C50025D5AF66' with the WWID
printed on the label of the disk.  It's likely not visible, however,
if you had a maintenance window you could pull the disks to write them
down and just keep the paper handy.

That, or use the trusty 'dd' to read from it and find the solid light.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Eric D. Mudama

On Mon, May 16 at 14:29, Paul Kraus wrote:

   I have stopped buying drives (and everything else) from Newegg
as they cannot be bothered to properly pack items. It is worth the
extra $5 per drive to buy them from CDW (who uses factory approved
packaging). Note that I made this change 5 or so years ago and Newegg
may have changed their packaging since then.


NewEgg packaging is exactly what you describe, unchanged in the last
few years.  Most recent newegg drive purchase was last week for me.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Eric D. Mudama

On Mon, May 16 at 21:55, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Paul Kraus

All drives have a very high DOA rate according to Newegg. The
way they package drives for shipping is exactly how Seagate
specifically says NOT to pack them here


8 months ago, newegg says they've changed this practice.
http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167


The drives I just bought were half packed in white foam then wrapped
in bubble wrap.  Not all edges were protected with more than bubble
wrap.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely Slow ZFS Performance

2011-05-04 Thread Eric D. Mudama

On Wed, May  4 at 12:21, Adam Serediuk wrote:

Both iostat and zpool iostat show very little to zero load on the devices even 
while blocking.

Any suggestions on avenues of approach for troubleshooting?


is 'iostat -en' error free?


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-03 Thread Eric D. Mudama

On Tue, May  3 at 17:39, Rich Teer wrote:

Hi all,

I'm playing around with nearline backups using zfs send | zfs recv.
A full backup made this way takes quite a lot of time, so I was
wondering: after the initial copy, would using an incremental send
(zfs send -i) make the process much quick because only the stuff that
had changed between the previous snapshot and the current one be
copied?  Is my understanding of incremental zfs send correct?


Your understanding is correct.  We use -I, not -i, since it can send
multiple snapshots with a single command.  Only the amount of changed
data is sent with an incremental 'zfs send'.


Also related to this is a performance question.  My initial test involved
copying a 50 MB zfs file system to a new disk, which took 2.5 minutes
to complete.  The strikes me as being a bit high for a mere 50 MB;
are my expectation realistic or is it just because of my very budget
concious set up?  If so, where's the bottleneck?


Our setup does a send/recv at roughly 40MB/s over ssh connected to a
1gbit/s ethernet connection.  There are ways to make this faster by
not using an encrypted transport, but setup is a bit more advanced
than just an ssh 'zfs recv' command line.


The source pool is on a pair of 146 GB 10K RPM disks on separate
busses in a D1000 (split bus arrangement) and the destination pool
is on a IOMega 1 GB USB attached disk.  The machine to which both
pools are connected is a Sun Blade 1000 with a pair of 900 MHz US-III
CPUs and 2 GB of RAM.  The HBA is Sun's dual differential UltraSCSI
PCI card.  The machine was relatively quiescent apart from doing the
local zfs send | zfs recv.


I'm guessing that the USB bus and/or the USB disk is part of your
bottleneck.  UltraSCSI should be plenty fast and your CPU should be
fine too.

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Eric D. Mudama


Hi.  While doing a scan of disk usage, I noticed the following oddity.
I have a directory of files (named file.dat for this example) that all
appear as ~1.5GB when using 'ls -l', but that (correctly) appear as ~250KB
files when using 'ls -s' or du commands:

  edmudama$ ls -l file.dat
  -rwxrwx---+  1 remlab   staff1447317088 Jun  5  2010 file.dat

  edmudama$ /usr/bin/ls -l file.dat
  -rwxrwx---+  1 remlab   staff1447317088 Jun  5  2010 file.dat

  edmudama$ ls -ls file.dat
   521 -rwxrwx---+  1 remlab   staff1447317088 Jun  5  2010 file.dat

  edmudama$ du -sh file.dat
   260K   file.dat

  edmudama$ /usr/bin/du -s file.dat
  521 file.dat

  edmudama$ /usr/bin/ls -s file.dat
   521 file.dat

I am running oi_148, though the files were created likely back when we
were using an older opensolaris (2008.11) on this same machine.  The
results with both gnu ls and solaris ls are identical.  Dedup is not
enabled on any pool, nor has it ever been enabled.

Filesystem is zfs version 4, Pool is ZFS pool version 28.

A scrub of the pool is consistent and shows no errors, and the sizing
reported in 'zpool list' would appear to match the du block counts
from what I can tell.

  edmudama$ zpool list
  NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
  rpool  29.8G  22.0G  7.79G73%  1.00x  ONLINE  -
  tank   1.81T   879G   977G47%  1.00x  ONLINE  -

Is something broken?  Any idea why I am seeing the wrong sizes in ls?

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Eric D. Mudama

On Mon, May  2 at 14:01, Bob Friesenhahn wrote:

On Mon, 2 May 2011, Eric D. Mudama wrote:



Hi.  While doing a scan of disk usage, I noticed the following oddity.
I have a directory of files (named file.dat for this example) that all
appear as ~1.5GB when using 'ls -l', but that (correctly) appear as ~250KB
files when using 'ls -s' or du commands:


These are probably just sparse files.  Nothing to be alarmed about.


They were created via CIFS.  I thought sparse files were an iSCSI concept, no?

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Eric D. Mudama

On Mon, May  2 at 20:50, Darren J Moffat wrote:

On 05/ 2/11 08:41 PM, Eric D. Mudama wrote:

On Mon, May 2 at 14:01, Bob Friesenhahn wrote:

On Mon, 2 May 2011, Eric D. Mudama wrote:



Hi. While doing a scan of disk usage, I noticed the following oddity.
I have a directory of files (named file.dat for this example) that all
appear as ~1.5GB when using 'ls -l', but that (correctly) appear as
~250KB
files when using 'ls -s' or du commands:


These are probably just sparse files. Nothing to be alarmed about.


They were created via CIFS. I thought sparse files were an iSCSI
concept, no?


iSCSI is a block level protocol.  Sparse files are a filesystem level 
concept that is understood my many filesystems including CIFS and ZFS 
and many others.


Yea, kept googling and it makes sense.  I guess I am simply surprised
that the application would have done the seek+write combination, since
on NTFS (which doesn't support sparse) these would have been real
1.5GB files, and there would be hundreds or thousands of them in
normal usage.

thx!

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Eric D. Mudama

On Mon, May  2 at 15:30, Brandon High wrote:

On Mon, May 2, 2011 at 1:56 PM, Eric D. Mudama
edmud...@bounceswoosh.org wrote:

that the application would have done the seek+write combination, since
on NTFS (which doesn't support sparse) these would have been real
1.5GB files, and there would be hundreds or thousands of them in
normal usage.


NTFS supports sparse files.
http://www.flexhex.com/docs/articles/sparse-files.phtml



ok corrected, thx.  my google-fu had indicated otherwise, thanks

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-09 Thread Eric D. Mudama

On Fri, Apr  8 at 22:03, Erik Trimble wrote:
I want my J4000's back, too.  And, I still want something like HP's 
MSA 70 (25 x 2.5 drive JBOD  in a 2U formfactor)


Just noticed that SuperMicro is now selling a 4U 72-bay 2.5 6Gbit/s
SAS chassis, the SC417.  Unclear from the documentation how many
6Gbit/s SAS lanes are connected for that many devices though.  Maybe
that plus a support contract from Sun would be a worthy replacement,
though you definitely won't have a single vendor to contact for
service issues.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Eric D. Mudama

On Fri, Apr  8 at 18:08, Chris Banal wrote:
Can anyone comment on Solaris with zfs on HP systems? Do things work 
reliably? When there is trouble how many hoops does HP make you jump 
through (how painful is it to get a part replaced that isn't flat out 
smokin')? Have you gotten bounced between vendors?


When I was choosing between HP and Dell about two years ago, the HP
RAID adapter wasn't supported out-of-the-box by solaris, while the
Dell T410/610/710 systems were using the Dell SAS-6i/R, which is a
rebranded LSI 1068i-R adapter.  I believe Dell's H200 is basically an
LSI 9211-8i, which also works well.

I can't comment on HP's support, I have no experience with it.  We now
self-support our software (OpenIndiana b148)

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot replace c10t0d0 with c10t0d0: device is too small

2011-03-04 Thread Eric D. Mudama

On Fri, Mar  4 at  9:22, Robert Hartzell wrote:

In 2007 I bought 6 WD1600JS 160GB sata disks and used 4 to create a raidz 
storage pool and then shelved the other two for spares. One of the disks failed 
last night so I shut down the server and replaced it with a spare. When I tried 
to zpool replace the disk I get:

zpool replace tank c10t0d0
cannot replace c10t0d0 with c10t0d0: device is too small

The 4 original disk partition tables look like this:

Current partition table (original):
Total disk sectors available: 312560317 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
 0usrwm34  149.04GB  312560350
 1 unassignedwm 0   0   0
 2 unassignedwm 0   0   0
 3 unassignedwm 0   0   0
 4 unassignedwm 0   0   0
 5 unassignedwm 0   0   0
 6 unassignedwm 0   0   0
 8   reservedwm 3125603518.00MB  312576734

Spare disk partition table looks like this:

Current partition table (original):
Total disk sectors available: 312483549 + 16384 (reserved sectors)

Part  TagFlag First Sector Size Last Sector
 0usrwm34  149.00GB  312483582
 1 unassignedwm 0   0   0
 2 unassignedwm 0   0   0
 3 unassignedwm 0   0   0
 4 unassignedwm 0   0   0
 5 unassignedwm 0   0   0
 6 unassignedwm 0   0   0
 8   reservedwm 3124835838.00MB  312499966

So it seems that two of the disks are slightly different models and are about 
40mb smaller then the original disks.



One comment: The IDEMA LBA01 spec size of a 160GB device is
312,581,808 sectors.

Instead of those WD models, where neither the old nor new drives
follow the IDEMA recommendation, consider buying a drive that reports
that many sectors.  Almost all models these days should be following
the IDEMA recommendations due to all the troubles people have had.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good SLOG devices?

2011-03-02 Thread Eric D. Mudama

On Wed, Mar  2 at  9:58, David Dyer-Bennet wrote:


On Tue, March 1, 2011 10:35, Garrett D'Amore wrote:


a) do you need an SLOG at all?  Some workloads (asynchronous ones) will
never benefit from an SLOG.


I've been fighting the urge to maybe do something about ZIL (which is what
we're talking about here, right?).  My load is CIFS, not NFS (so not
synchronous, right?), but there are a couple of areas that are significant
to me where I do decent-size (100MB to 1GB) sequential writes (to
newly-created files).  On the other hand, when those writes seem to me to
be going slowly, the disk access lights aren't mostly on, suggesting that
the disk may not be what's holding me up.  I can test that by saving to
local disk and comparing times, also maybe running zpool iostat.

This is a home system, lightly used; the performance issue is me sitting
waiting while big Photoshop files save.  So of some interest to me
personally, and not at ALL like what performance issues on NAS usually
look like.  It's on a UPS, so I'm not terribly worried about losses on
power failure; and I'd just lose my work since the last save, generally,
at worst.

I might not believe the disk access lights on the box (Chenbro chassis,
with two 4-drive hot-swap bays for the data disks; driven off the
motherboard  SATA plus a Supermicro 8-port SAS controller with SAS-to-SATA
cables).  In doing a drive upgrade just recently, I got rather confusing
results with the lights, perhaps the controller or the drive model made a
difference in when the activity lights came on.

The VDEVs in the pool are mirror pairs.  It's been expanded twice by
adding VDEVs and once by replacing devices in one VDEV.  So the load is
probably fairly unevenly spread across them just now.  My desktop connects
to this server over gigabit ethernet (through one switch; the boxes sit
next to each other on a shelf over my desk).

I'll do more research before spending money.  But as a question of general
theory, should a decent separate intent log device help for a single-user
sequential write sequence in the 100MB to 1GB size range?


ZIL, as I understand it, is only for small synchronous writes, the
opposite of your workload.  If you don't have a SLOG, the ZIL is
embedded in your pool anyway.  Above a certain size, the writes go
straight to the pool's final storage location.

I'd be curious if you're getting errors in your SMB stream, or maybe
your server is set to hold onto too much data before flushing (default
is 45 seconds, and there's been reports of the system not always
force-flushing early when the buffers fill) I've heard reports of
short-stroking the amount of time it accumulates write data resulting
in improved performance in some workloads.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance

2011-02-27 Thread Eric D. Mudama

On Mon, Feb 28 at  0:30, Toby Thain wrote:

I would expect COW puts more pressure on near-full behaviour compared to
write-in-place filesystems. If that's not true, somebody correct me.


Off the top of my head, I think it'd depend on the workload.

Write-in-place will always be faster with large IOs than with smaller
IOs, and write-in-place will always be faster than CoW with large
enough IO because there's no overhead for choosing where the write
goes (and with large enough IO, seek overhead ~= 0)

With CoW, it probably matters more what the previous version of the
LBAs you're overwriting looked like, plus how fragmented the free
space is.  Into a device with plenty of free space, small writes
should be significantly faster than write-in-place.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-25 Thread Eric D. Mudama

On Fri, Feb 25 at 22:29, Nathan Kroenert wrote:
I don't recall if Solaris 10 (Sparc or X86) actually has the si3124 
driver, but if it does, for a cheap thrill, they are worth a bash. I 
have no problems pushing 4 disks pretty much flat out on a PCI-X 133 
3124 based card. (note that there was a pci and a pci-x version of 
the 3124, so watch out.)


Most 3124 I've seen are PCI-X natively, but they work fine in PCI
slots, albiet with less bandwidth available.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-23 Thread Eric D. Mudama

On Wed, Feb 23 at 13:29, Andrew Gabriel wrote:

Mauricio Tavares wrote:
Perhaps a bit off-topic (I asked on the rescue list -- 
http://web.archiveorange.com/archive/v/OaDWVGdLhxWVWIEabz4F -- and 
was told to try here), but I am kinda shooting in the dark: I have 
been finding online scattered and vague info stating that this card 
can be made to work with a sparc solaris 10 box (http://old.nabble.com/eSATA-or-firewire-in-Solaris-Sparc-system-td27150246.html 
is the only link I can offer right now). Can anyone confirm or deny 
that?


3112/3114 was a very early (possibly the first?) SATA chipset, I 
think aimed for use before SATA drivers had been developed. I would 
suggest looking for something more modern.


Not only that, the 3112 would do non-sector-aligned FIS transfers for
writes  15 sectors, which caused all sorts of trouble for the
firmware developers at the disk companies, resulting in numerous
reports of compatibility and performance problems with 3112/3114
hardware.

I +1 the suggestion to find something more modern if at all possible.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SIL3114 and sparc solaris 10

2011-02-23 Thread Eric D. Mudama

On Wed, Feb 23 at 13:16, Mauricio Tavares wrote:

On Wed, Feb 23, 2011 at 12:53 PM, Eric D. Mudama
edmud...@bounceswoosh.org wrote:

On Wed, Feb 23 at 13:29, Andrew Gabriel wrote:


Mauricio Tavares wrote:


Perhaps a bit off-topic (I asked on the rescue list --
http://web.archiveorange.com/archive/v/OaDWVGdLhxWVWIEabz4F -- and was told
to try here), but I am kinda shooting in the dark: I have been finding
online scattered and vague info stating that this card can be made to work
with a sparc solaris 10 box
(http://old.nabble.com/eSATA-or-firewire-in-Solaris-Sparc-system-td27150246.html
is the only link I can offer right now). Can anyone confirm or deny that?


3112/3114 was a very early (possibly the first?) SATA chipset, I think
aimed for use before SATA drivers had been developed. I would suggest
looking for something more modern.


Not only that, the 3112 would do non-sector-aligned FIS transfers for
writes  15 sectors, which caused all sorts of trouble for the
firmware developers at the disk companies, resulting in numerous
reports of compatibility and performance problems with 3112/3114
hardware.

I +1 the suggestion to find something more modern if at all possible.


Oh, just lovely. What would you suggest instead? I mean, besides
canning the machine altogether ;)


Since SAS adapters can talk to SATA disks, I'd just use a compatible
SAS adapter.  It'll cost a few hundred dollars, but likely save you a
lot of effort and frustration.

There are a number of vendors of PCI, PCI-X and PCI-e SAS adapters for
SPARC hardware, based on a quick google.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Incremental send/recv interoperability

2011-02-15 Thread Eric D. Mudama

On Tue, Feb 15 at 11:18, Erik ABLESON wrote:

Just wondering if an expert can chime in on this one.

I have an older machine running 2009.11 with a zpool at version
14. I have a new machine running Solaris Express 11 with the zpool
at version 31.

I can use zfs send/recv to send a filesystem from the older machine
to the new one without any difficulties. However, as soon as I try
to update the remote copy with an incremental send/recv I get back
the error of cannot receive incremental stream: invalid backup
stream.

I was under the impression that the streams were backwards
compatible (ie a newer version could receive older streams) which
appears to be correct for the initial send/recv operation, but
failing on the incremental.


Sounds like you may need to force an older pool version on the
destination machine to use it in this fashion, since it's adding data
to a stream that has been converted to use the new pool when you recv
it.

I could be wrong though, we update our pools in lockstep and err on
the side of backwards compliance with our multi-system backup.


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and TRIM - No need for TRIM

2011-02-07 Thread Eric D. Mudama

On Mon, Feb  7 at 20:43, Bob Friesenhahn wrote:

On Sun, 6 Feb 2011, Orvar Korvar wrote:


1) Using SSD without TRIM is acceptable. The only drawback is that without 
TRIM, the SSD will write much more, which effects life time. Because when the 
SSD has written enough, it will break.


Why do you think that the SSD should necessarily write much more?  I 
don't follow that conclusion.


If I can figure out how to design a SSD which does not necessarily 
write much more, I suspect that an actual SSD designer can do the 
same.


Blocks/sectors marked as being TRIM'd do not need to be maintained by
the garbage collection engine.  Depending on the design of the SSD,
this can significantly reduce the write amplification of the SSD.


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question regarding zfs snapshot -r and also regarding zfs send -R

2011-02-02 Thread Eric D. Mudama

On Wed, Feb  2 at 17:03, Rahul Deb wrote:

Is zfs send -R  sends snapshot all at once OR does it send all
the descendent snapshots serially(one after another) ?  I am
asking this because, if it sends serially, send/recv will take
long time to finish based on the number of snapshots need to be
send and this duration will gradually increase if number of
snapshots also keep increasing. Is my assumption correct?
Thanks,
-- Rahul


We do not use send|recv for descendent filesystems, so I don't have
personal experience, but I am pretty sure the send is an
all-or-nothing operation which runs serially.  Our setup sustains
about 40MB/s on our send|recv using SSL to a backup server with
1500MTU, so I don't think we're doing that poorly.  Our 40MB/s is
measured with the system under concurrent server load, so I don't
think it's doing that poorly.

Parallelization of the send|recv would only net a benefit if a serial
approach was unable to saturate the network interface, but I've yet to
witness that case.  Some setups will forgo ssl and instead use one of
the network pipe interfaces that isn't encrypted, which can improve
performance when CPU bound, but we haven't tried to setup anything
like that since we haven't needed it.

Using -R by itself will result in a larger send|recv each time, as you
suspect, unless you also use -I/-i to do incrementals.  In the -I/-i
case, the size of the incremental is relative to the amount of new
data since the last time you sent incrementals, similar to rsync or a
half dozen other techniques.

At work we always use -i, and our send|recv is anywhere from 5-20
minutes, depending on what data was added or modified.

--eric

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-02 Thread Eric D. Mudama

On Wed, Feb  2 at 20:40, Edward Ned Harvey wrote:

Wouldn't multiple platters of the same density still produce a throughput
that's a multiple of what it would have been with a single platter?  I'm
assuming the heads on the multiple platters are all able to operate
simultaneously.


Nope.  Most HDDs today have a single read channel, and they select
which head uses that channel at any point in time.  They cannot use
multiple heads at the same time, because the heads to not travel the
same path on their respective surfaces at the same time.  There's no
real vertical alignment of the tracks between surfaces, and every
surface has its own embedded position information that is used when
that surface's head is active.  There were attempts at multi-actuator
designs with separate servo arms and multiple channels, but
mechanically they're too difficult to manufacture at high yields as I
understood it.

http://www.tomshardware.com/news/seagate-hdd-harddrive,8279.html



--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-02 Thread Eric D. Mudama

On Wed, Feb  2 at 20:45, Edward Ned Harvey wrote:

For sustained throughput, I don't measure in IOPS.  I measure in MB/s, or
Mbit/s.  For a slow hard drive, 500Mbit/s.  For a fast one, 1 Gbit/s or
higher.  I was surprised by the specs of the seagate disks I just emailed a
moment ago.  1Gbit out of a 7.2krpm drive...  That's what I normally expect
out of a 15krpm drive.


It used to be that enterprise grade, higher RPM devices used more
expensive electronics, but that's not really the case anymore.  It
seems most vendors are trying to use common electronics across their
product lines, which generally makes great business sense.

These days I think most HDD companies get their channel working at a
certain max bitrate, and format their drive zones to match that
bitrate at the max radius where velocity is the highest.  This is a
bit of a simplification, but it's the general idea.

When the drive is spinning the media less quickly, in a 7200 RPM
device, they can pack the bits in more tightly, which lowers overall
cost because they need fewer heads and platters to achieve a target
capacity.  It just so happens that the max bits/second flying under
the read head is a constant pegged to the channel design.  All other
things being equal, the 15k and the 7200 drive, which share
electronics, will have the same max transfer rate at the OD.


I know people sometimes (often) use IOPS even when talking about sequential
operations, but I only say IOPS for random operations.


Me too, though not everyone realizes how much overhead there can be in
small operations, even sequential ones.

--eric


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm faulty not showing faulty/offline disks?

2011-02-02 Thread Eric D. Mudama

On Wed, Feb  2 at 21:05, Krunal Desai wrote:

So build the current version of smartmontools. As you should have seen in my 
original response, I'm using 5.40. Bugs in 5.36 are unlikely to be interesting 
to the maintainers of the package ;-)


Oops, missed that in your log. Will try compiling from source and see what 
happens.

Also, recently it seems like all the links to tools I need are broken. Where 
can I find a lsiutil binary for Solaris?


If you search for 'lsiutil solaris' on lsi.com, it'll direct you to
zipfile that includes a solaris binary for x86 solaris.

At home now so can't test it.

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)

2011-02-02 Thread Eric D. Mudama

On Thu, Feb  3 at 14:18, taemun wrote:

  Uhm. Higher RPM = higher linear speed of the head above the platter =
  higher throughput. If the bit pitch (ie the size of each bit on the
  platter) is the same, then surely a higher linear speed corresponds with a
  larger number of bits per second?
  So if all other things being equal includes the bit density, and radius
  to the edge of the media, then ... surely higher rpm = higher throughput?
  Cheers,


Point being that they have to lower the bit density on high RPM drives
to fit within the bandwidth constraints of the channel.

If they could just get their channel working at 3GHz instead of 2GHz
or whatever, they'd use that capability to pack even more bits into
the consumer drives to lower costs.


--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question regarding zfs snapshot -r

2011-02-01 Thread Eric D. Mudama

On Tue, Feb  1 at 10:54, Rahul Deb wrote:

  Hello All,
  I have two questions related to zfs snapshot -r
  1. When zfs snapshot -r tank@today command is issued, does it creates
  snapshots for all the�descendent file systems at the same moment? I mean
  to say if the command is issued at 10:20:35 PM, does the creation time of
  all the snapshots for descendent file systems are same?
  2. Say, tank has around 5000 descendent file systems and zfs snapshot -r
  tank@today takes around 10 seconds to complete. If there is a new file
  systems created under tank within that 10 seconds period, does that
  snapshot process includes the new file system created within that 10
  seconds?
  OR it will exclude that newly created filesystem?
  Thanks,
  -- Rahul


I believe the contract is that the content of all recursive snapshots
are consistent with the instant in time at which the snapshot command
was executed.

Quoting from the ZFS Administration Guide:

 Recursive ZFS snapshots are created quickly as one atomic
 operation. The snapshots are created together (all at once) or not
 created at all. The benefit of such an operation is that the snapshot
 data is always taken at one consistent time, even across descendent
 file systems.

Therefore, in #2 above, the snapshot wouldn't include the new file in
the descendent file system, because it was created after the moment in
time when the snapshot was initiated.

In #1 above, I would guess the snapshot time is the time of the
initial command across all filesystems in the tree, even if it takes
10 seconds to actually complete the command.  However, I have no such
system where I can prove this guess as correct or not.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

2011-01-28 Thread Eric D. Mudama

On Fri, Jan 28 at  8:25, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Eff Norwood

We tried all combinations of OCZ SSDs including their PCI based SSDs and
they do NOT work as a ZIL. After a very short time performance degrades
horribly and for the OCZ drives they eventually fail completely.


This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there's really no way for the OS to know
if it's writing to a 4K block which happens to be shared with another 4K
block in the 8K page.  So it's unavoidable, and whenever it happens, the
drive can't simply write.  It must read modify write, which is obviously
much slower.


The reality is way more complicated, and statements like the above may
or may not be true on a vendor-by-vendor basis.

As time passes, the underlying NAND geometries are designed for
certain sets of advantages, continually subject to re-evaluation and
modification, and good SSD controllers on the top of NAND or other
solid-state storage will map those advantages effectively into our
problem domains as users.

Testing methodologies are improving over time as well, and eventually
it will be more clear which devices are suited to which tasks.

The suitability of a specific solution into a problem space will
always be a balance between cost, performance, reliability and time to
market.  No single solution (RAM SAN, RAM SSD, NAND SSD, BBU
controllers, rotating HDD, etc.) wins in every single area, or else we
wouldn't be having this discussion.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] reliable, enterprise worthy JBODs?

2011-01-25 Thread Eric D. Mudama

On Tue, Jan 25 at 21:28, Peter Tribble wrote:

On Tue, Jan 25, 2011 at 8:50 PM, Lasse Osterild lass...@unixzone.dk wrote:


On 25/01/2011, at 19.04, Philip Brown wrote:


Any other suggestions for (large-)enterprise-grade, supported JBOD hardware for 
ZFS these days?
Either fibre or SAS would be okay.
--


I'd go with some Dell MD1200's, for us they ended up being cheaper (incl disks) 
than a SuperMicro case with the same model disks, and it's way nicer than the 
low-quality SuperMicro stuff.
It works perfectly fine with LSI 9200-8e SAS2 controllers under Solaris.   The 
SuperMicro boxes won't do multi-pathing to the same LSI 9200-8e controller.


As a matter of interest, what sort of system configurations are you
building? Are you daisy-chaining JBOD units? I like the overall idea,
but with direct attach and multi-pathing that's one HBA (and one slot


Unless I'm wrong, I think you only need 2 adapters per stack of JBODs.

With adapters A0 and A1, and JBODs J0 through J3, you get:

A0 - J0 - J1 - J2 - J3
A1 - J3 - J2 - J1 - J0

Yes, all the above are daisy-chained, starting at a different side of
the stack with each adapter.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-09 Thread Eric D. Mudama

On Sun, Jan  9 at 22:54, Peter Taps wrote:

Thank you all for your help. I am the OP.

I haven't looked at the link that talks about the probability of
collision. Intuitively, I still wonder how the chances of collision
can be so low. We are reducing a 4K block to just 256 bits. If the
chances of collision are so low, *theoretically* it is possible to
reconstruct the original block from the 256-bit signature by using a
simple lookup. Essentially, we would now have world's best
compression algorithm irrespective of whether the data is text or
binary. This is hard to digest.

Peter


simple lookup isn't so simple when there are 2^256 records to
search, however, fundamentally your understanding of hashes is
correct.

That being said, while at some point people might identify two
commonly-used blocks with the same hash (e.g. system library files or
other) the odds of it happening are extremely low.  Random
google-result website calculates you as needing ~45 exabytes in your
pool of 4KB chunk deduped data before you get to a ~10^-17 chance of a
hash collision:

http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html

Now, obviously the above is in the context of having to restore from
backup, which is rare, however in live usage I don't think the math
changes a whole lot.



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to increase iometer reading?

2011-01-09 Thread Eric D. Mudama

On Sun, Jan  9 at 23:07, Peter Taps wrote:

Hello,

We are building a zfs-based storage system with generic but
high-quality components. We would like to test the new system under
various loads. If we find that the iometer reading has started to
reduce under certain loads, I am wondering what performance counters
we should look for to identify the bottlenecks. Don't want to
replace a component just to find that there was no improvement in
iometer reading.


fsstat zfs 1
zpool iostat 1

any suggestions beyond that will require a lot more detail on your
setup and target workload

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for 3.5 SSD for ZIL

2010-12-23 Thread Eric D. Mudama

On Wed, Dec 22 at 23:29, Christopher George wrote:

Would having to perform a Secure Erase every hour, day, or even
week really be the most cost effective use of an administrators time?


You're assuming that the into an empty device performance is
required by their application.

For many users, the worst-case steady-state of the device (6k IOPS on
the Vertex2 EX, depending on workload, as per slide 48 in your
presentation) is so much faster than a rotating drive (50x faster,
assuming that cache disabled on a rotating drive is roughly 100 IOPS
with queueing), that it'll still provide a huge performance boost when
used as a ZIL in their system.

For a huge ZFS box providing tens of ZFS filesystems in a pool all
with huge user loads, sure, a RAM based device makes sense, but it's
overkill for some large percentage of ZFS users, I imagine.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for 3.5 SSD for ZIL

2010-12-23 Thread Eric D. Mudama

On Thu, Dec 23 at  9:14, Erik Trimble wrote:
The longer-term solution is to have SSDs change how they are 
designed, moving away from the current one-page-of-multiple-blocks as 
the atomic entity of writing, and straight to a one-block-per-page 
setup.  Don't hold your breath.


Will never happen using NAND technology.

Non-NAND SSDs may or may not have similar or related limitations.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for 3.5 SSD for ZIL

2010-12-23 Thread Eric D. Mudama

On Thu, Dec 23 at 17:11, Deano wrote:

Currently firmware is meant to help conventional file system usage. However
ZIL isn't normal usage and as such *IF* and it's a big if, we can
effectively bypass the firmware trying to be clever or at least help it be
clever then we can avoid the downgrade over time. In particular if we could
secure erase a few cells as once as required, the lifetime would be much
longer, I'd even argue that taking the wear leveling off the drives hand
would be useful in the ZIL case.


In most cases, an SSD knows something isn't valuable when it is
overwritten.  If the allocator for the ZIL would rewrite to sectors
no-longer-needed, instead of walking sequentially across the entire
available LBA space, slowdown of a ZIL would likely never occur on a
NAND SSD, since the drive would always have a good idea which sectors
were free and which were still in use.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-23 Thread Eric D. Mudama

On Thu, Dec 23 at 11:25, Stephan Budach wrote:

  Hi,

  as I have learned from the discussion about which SSD to use as ZIL
  drives, I stumbled across this article, that discusses short stroking for
  increasing IOPs on SAS and SATA drives:

  [1]http://www.tomshardware.com/reviews/short-stroking-hdd,2157.html

  Now, I am wondering if using a mirror of such 15k SAS drives would be a
  good-enough fit for a ZIL on a zpool that is mainly used for file services
  via AFP and SMB.
  I'd particulary like to know, if someone has already used such a solution
  and how it has worked out.


Haven't personally used it, but the worst case steady-state IOPS of
the Vertex2 EX, from the DDRDrive presentation, is 6k IOPS assuming a
full-pack random workload.

To achieve that through SAS disks in the same workload, you'll
probably spend significantly more money and it will consume a LOT more
space and power.

According to that Tom's article, a typical 15k SAS enterprise drive is
in the 600 IOPS ballpark when short-stroked and consumes about 15W
active.  Thus you're going to need ten of these devices, to equal the
degraded steady-state IOPS of an SSD.  I just don't think the math
works out.  At that point, you're probably better-off not having a
dedicated ZIL, instead of burning 10 slots and 150W.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for 3.5 SSD for ZIL

2010-12-23 Thread Eric D. Mudama

On Thu, Dec 23 at 10:49, Christopher George wrote:

My assumption was stated in the paragraph prior, i.e. vendor promised
random write IOPS.  Based on the inquires we receive, most *actually*
expect an OCZ SSD to perform as specified which is 50K 4KB
random writes for both the Vertex 2 EX and the Vertex 2 Pro.


Okay, I understand where you're coming from.

Yes, buyers must be aware of the test methodologies for published
benchmark results, especially those used to sell drives by the vendors
themselves.  Up to is generally a poor thing to base a buying
decision.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-21 Thread Eric D. Mudama

On Tue, Dec 21 at  8:24, Edward Ned Harvey wrote:

From: edmud...@mail.bounceswoosh.org
[mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama

On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:
If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order

of

multiple disks is highly correlated or equal by definition.

Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I'm guessing


Unless your drive is able to queue up a request to read every single used
part of the drive...  Which is larger than the command queue for any
reasonable drive in the world...  The point is, in order to be optimal you
have to eliminate all those seeks, and perform sequential reads only.  The
only seeks you should do are to skip over unused space.


I don't think you read my whole post.  I was saying this seek
calculation pre-processing would have to be done by the host server,
and while not impossible, is not trivial.  Present the next 32 seeks
to each device while the pre-processor works on the complete list of
future seeks, and the drive will do as well as possible.


If you're able to sequentially read the whole drive, skipping all the unused
space, then you're guaranteed to complete faster (or equal) than either (a)
sequentially reading the whole drive, or (b) seeking all over the drive to
read the used parts in random order.


Yes, I understand how that works.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A few questions

2010-12-20 Thread Eric D. Mudama

On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it's essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.


Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I'm guessing
that the trick (difficult, but not impossible) is how to solve a
travelling salesman route pathing problem where you have billions or
trillions of transactions, and do it fast enough that it was worth
doing any extra computation besides just giving the device 32+ queued
commands at a time that align with the elements of each ordered
transaction ID.

Add to that all the complexity of unwinding the error recovery in the
event that you fail checksum validation on transaction N-1 after
moving past transaction N, which would be a required capability if you
wanted to queue more than a single transaction for verification at a
time.

Oh, and do all of the above without noticably affecting the throughput
of the applications already running on the system.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iops...

2010-12-06 Thread Eric D. Mudama

On Mon, Dec  6 at 23:22, Roy Sigurd Karlsbakk wrote:

Hi all

The numbers I've heard say the number of iops for a raidzn volume
should be about the number of iops for the slowest drive in the
set. While this might sound like a good base point, I tend to
disagree. I've been doing some testing on some raidz2 volumes with
various sizes and similar various amount of VDEVs. It seems, with
iozone, the number of iops are rather high per drive, up to 250 for
these 7k2 drives, even with an 8-drive RAIDz2 VDEV. The testing has
not utilized a high number of theads (yet), but still, it looks like
for most systems, RAIDzN performance should be quite decent.


I think that statement is meant to describe random IO.  I doubt anyone
is getting 250 IOPS out of a 7200rpm drive, unless it has been
significantly short-stroked or is using a very deep SCSI queue depth.

Sequential IO should be fast in any configuration with lots of drives.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Running on Dell hardware?

2010-10-22 Thread Eric D. Mudama

On Wed, Oct 13 at 15:44, Edward Ned Harvey wrote:

From: Henrik Johansen [mailto:hen...@scannet.dk]

The 10g models are stable - especially the R905's are real workhorses.


You would generally consider all your machines stable now?
Can you easily pdsh to all those machines?

kstat | grep current_cstate ; kstat | grep supported_max_cstates


Dell T610, machine has been stable since we got it (relative to the
failure modes you've mentioned)

current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1


--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Eric D. Mudama

On Wed, Oct 13 at 10:13, Edward Ned Harvey wrote:

  I have a Dell R710 which has been flaky for some time.  It crashes about
  once per week.  I have literally replaced every piece of hardware in it,
  and reinstalled Sol 10u9 fresh and clean.



  I am wondering if other people out there are using Dell hardware, with
  what degree of success, and in what configuration?


Dell T610 w/ the default SAS 6/IR adapter has been working fine for us
for 18 months.  All issues have been software bugs in opensolaris so
far.

Not much of a data point, but I have no reason not to buy another Dell
server in the future.

Out of curiosity, did you run into this:
http://blogs.everycity.co.uk/alasdair/2010/06/broadcom-nics-dropping-out-on-solaris-10/

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Eric D. Mudama

On Wed, Oct  6 at 22:04, Edward Ned Harvey wrote:

* Because ZFS automatically buffers writes in ram in order to
aggregate as previously mentioned, the hardware WB cache is not
beneficial.  There is one exception.  If you are doing sync writes
to spindle disks, and you don't have a dedicated log device, then
the WB cache will benefit you, approx half as much as you would
benefit by adding dedicated log device.  The sync write sort-of
by-passes the ram buffer, and that's the reason why the WB is able
to do some good in the case of sync writes.


All of your comments made sense except for this one.

Every N seconds when the system decides to burst writes to media from
RAM, those writes are only sequential in the case where the underlying
storage devices are significantly empty.

Once you're in a situation where your allocations are scattered across
the disk due to longer-term fragmentation, I don't see any way that a
write cache would hurt performances on the devices, since it'd allow
the drive to reorder writes to the media within that burst of data.

Even though ZFS is issuing writes of ~256 sectors if it can, that is
only a fraction of a revolution on a modern drive, so random writes of
128KB still have significant opportunity for reordering optimization.

Granted, with NCQ or TCQ you can get back much of the cache-disabled
performance loss, however, in any system that implements an internal
queue depth greater than the protocol-allowed queue depth, there is
opportunity for improvement, to an asymptotic limit driven by servo
settle speed.

Obviously this performance improvement comes with the standard WB
risks, and YMMV, IANAL, etc.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Configuration questions for Home File Server (CPU cores, dedup, checksum)?

2010-09-07 Thread Eric D. Mudama

On Tue, Sep  7 at 17:13, Russ Price wrote:

On 09/07/2010 03:58 PM, Craig Stevenson wrote:

I am working on a home file server.  After reading a wide range of
blogs and forums, I have a few questions that are still not clear
to me

1.  Is there a benefit in having quad core CPU (e.g. Athlon II X4
vs X2)? All of the web blogs seem to suggest using lower-wattage
dual core CPUs.  But; with the recent advent of dedup, SHA256
checksum, etc., I am now wondering if opensolaris is better served
with quad core.


With a big RAIDZ3, it's well worth having extra cores. A scrub on my 
eight-disk RAIDZ2 uses about 60% of all four cores on my Athlon II X4 
630.


How are you measuring using 60% across all four cores?

I kicked off a scrub just to see, and we're scrubbing at 200MB/s (2
vdevs) and the CPU is 94% idle, 6% kernel, 0% IOWAIT.

zpool-tank is using 3.2% CPU as shown by 'ps aux | grep tank'

Am I missing something?


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what is zfs doing during a log resilver?

2010-09-04 Thread Eric D. Mudama

On Sat, Sep  4 at  3:14, Giovanni Tirloni wrote:

  Good question. Here it takes little over 1 hour to resilver a 32GB SSD in
  a mirror. I've always wondered what exactly it was doing since it was
  supposed to be 30 seconds worth of data. It also generates lots of
  checksum errors.


An hour?  Our boot drives (32GB X25-E) will resilver in about 1 minute.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-08-30 Thread Eric D. Mudama

On Mon, Aug 30 at 15:05, Ray Van Dolson wrote:

I want to fix (as much as is possible) a misalignment issue with an
X-25E that I am using for both OS and as an slog device.

This is on x86 hardware running Solaris 10U8.

Partition table looks as follows:

Part  TagFlag CylindersSizeBlocks
 0   rootwm   1 - 1306   10.00GB(1306/0/0) 20980890
 1 unassignedwu   0   0 (0/0/0)   0
 2 backupwm   0 - 3886   29.78GB(3887/0/0) 62444655
 3 unassignedwu1307 - 3886   19.76GB(2580/0/0) 41447700
 4 unassignedwu   0   0 (0/0/0)   0
 5 unassignedwu   0   0 (0/0/0)   0
 6 unassignedwu   0   0 (0/0/0)   0
 7 unassignedwu   0   0 (0/0/0)   0
 8   bootwu   0 -07.84MB(1/0/0)   16065
 9 unassignedwu   0   0 (0/0/0)   0

And here is fdisk:

Total disk size is 3890 cylinders
Cylinder size is 16065 (512 byte) blocks

  Cylinders
 Partition   StatusType  Start   End   Length%
 =   ==  =   ===   ==   ===
 1   ActiveSolaris   1  38893889100

Slice 0 is where the OS lives and slice 3 is our slog.  As you can see
from the fdisk partition table (and from the slice view), the OS
partition starts on cylinder 1 -- which is not 4k aligned.

I don't think there is much I can do to fix this without reinstalling.

However, I'm most concerned about the slog slice and would like to
recreate its partition such that it begins on cylinder 1312.

So a few questions:

   - Would making s3 be 4k block aligned help even though s0 is not?
   - Do I need to worry about 4k block aligning the *end* of the
 slice?  eg instead of ending s3 on cylinder 3886, end it on 3880
 instead?

Thanks,
Ray


Do you specifically have benchmark data indicating unaligned or
aligned+offset access on the X25-E is significantly worse than aligned
access?

I'd thought the tier1 SSDs didn't have problems with these workloads.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-08-30 Thread Eric D. Mudama

On Tue, Aug 31 at  6:12, Edho P Arief wrote:

On Tue, Aug 31, 2010 at 6:03 AM, Ray Van Dolson rvandol...@esri.com wrote:

In any case -- any thoughts on whether or not I'll be helping anything
if I change my slog slice starting cylinder to be 4k aligned even
though slice 0 isn't?



some people claims that due to how zfs works, there will be
performance hit as long the reported sector size is different with the
physical size.

This thread[1] has the discussion on what happened and how to handle
such drives on freebsd.

[1] http://marc.info/?l=freebsd-fsm=126976001214266w=2


Yes, but that's for a 4k rotating drive, which has a much different
latency profile than an SSD.  I was wondering if anyone had a
benchmarking showing this alignment mattered on the latest SSDs.  My
guess is no, but I have no data.



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Eric D. Mudama

On Fri, Aug 27 at  6:16, Eff Norwood wrote:


David asked me what I meant by filled up. If you make the unwise
decision to use an SSD as your ZIL, at some point days to weeks
after you install it, all of the pages will be allocated and you
will suddenly find the device to be slower than a conventional disk
drive. This is due to the way SSDs work. A great write up about how
this works is here:

http://www.anandtech.com/show/2738/8


While it's an interesting writeup, I think some assumptions are being
made that may not be quite correct.  In the case of a ZIL, with a
relatively small data set ( 1GB typically) on your SSD, if designed
correctly, drive will always be running with many gigabytes of
scratch area available.

Fully written SSDs may write more slowly than partially written SSDs
in some workloads, but I wouldn't expect a ZIL usage model to create
the scenario you linked due to the limited data set size.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS development moving behind closed doors

2010-08-22 Thread Eric D. Mudama

On Sat, Aug 21 at  4:13, Orvar Korvar wrote:

And by the way: Wasn't there a comment of Linus Torvals recently that people shound 
move their low-quality code into the codebase ??? ;)

Anyone knows the link? Good against the Linux fanboys. :o)


Can't find the original reference, but I believe he was arguing that
by moving code into the kernel and marking as experimental, it's more
likely to be tested and have the bugs worked out, than if it forever
lives as patchsets.

Given the test environment, can't say I can argue against that point
of view.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Eric D. Mudama

On Mon, Aug 16 at  8:52, Ray Van Dolson wrote:

On Mon, Aug 16, 2010 at 08:48:31AM -0700, Joerg Schilling wrote:

Ray Van Dolson rvandol...@esri.com wrote:

  I absolutely guarantee Oracle can and likely already has
  dual-licensed BTRFS.

 Well, Oracle obviously would want btrfs to stay as part of the Linux
 kernel rather than die a death of anonymity outside of it...

 As such, they'll need to continue to comply with GPLv2 requirements.

No, there is definitely no need for Oracle to comply with the GPL as they
own the code.



Maybe there's not legally, but practically there is.  If they're not
GPL compliant, why would Linus or his lieutenants continue to allow the
code to remain part of the Linux kernel?


The snapshot of btrfs development would obviously remain GPL, that
can't be taken away from the kernel and anyone is free to continue
GPL development of that work.

However, Oracle can freely close up all future development and change
future licensing.  It obviously won't affect the previous
kernel-included snapshot, but depending on critical mass, may or may
not result in the bitrot of btrfs in linux.


And what purpose would btrfs serve Oracle outside of the Linux kernel?


Maybe allowing SANs built upon btrfs to be natively used within
Solaris/Oracle at some point in the future?  Adding btrfs-zfs
conversion utilities that do things like maintain snapshots, data set
properties, etc?

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Eric D. Mudama

On Mon, Aug 16 at 11:15, Tim Cook wrote:

  Or, for all you know, Chris Mason's contract has a non-compete that states
  if he leaves Oracle he's not allowed to work on any project he was a part
  of for five years.


IANAL, but as my discussions with employment lawyers in my state have
explained to me, a non-compete cannot legally prevent you from earning
a living.  If your one skill is in writing filesystems, you cannot be
prevented from doing so by a noncompete.

However, please get your own legal advice, as it varies significantly
state-to-state.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS development moving behind closed doors

2010-08-13 Thread Eric D. Mudama

On Fri, Aug 13 at 19:06, Frank Cusack wrote:

Interesting POV, and I agree.  Most of the many distributions of
OpenSolaris had very little value-add.  Nexenta was the most interesting
and why should Oracle enable them to build a business at their expense?


These distributions are, in theory, the gateway drug where people
can experiment inexpensively to try out new technologies (ZFS, dtrace,
crossbow, comstar, etc.) and eventually step up to Oracle's big iron
as their business grows.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-13 Thread Eric D. Mudama

On Fri, Aug 13 at 19:03, Frank Cusack wrote:

On 8/13/10 3:39 PM -0500 Tim Cook wrote:

Quite frankly, I think there will be an even faster decline of Solaris
installed base after this move.  I know I have no interest in pushing it
anywhere after this mess.


I haven't met anyone who uses Solaris because of OpenSolaris.


That's because the features that made opensolaris so attractive were
the bleeding-edge zfs versions and comstar, and i don't think either
had yet been backported to Solaris.

I'm sure Solaris uptake would increase over time, once those features
made it into the main OS.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-11 Thread Eric D. Mudama

On Tue, Aug 10 at 21:57, Peter Taps wrote:

Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk 
fails.


The data for any given sector striped across all drives can be thought
of as:

A+B+C = P

where A..C represent the contents of sector N on devices a..c, and P
is the parity located on device p.

From that, you can do some simple algebra to convert it to:

A+B+C-P = 0

If any of A,B,C or P are unreadable (assume B), from simple algebra,
you can solve for any single unknown (x) to recreate it:

A+x+C = P
A+x+C-A-C = P-A-C
x = P-A-C

and voila, you now have your original B contents, since B=x.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-10 Thread Eric D. Mudama

On Tue, Aug 10 at 15:40, Peter Taps wrote:

Hi,

First, I don't understand why parity takes so much space. From what
I know about parity, there is typically one parity bit per
byte. Therefore, the parity should be taking 1/8 of storage, not 1/3
of storage. What am I missing?


Think of it as 1 bit of parity per N-wide RAID'd bit stored on your
data drives, which is why it occupies 1/N size.

With 3 disks it's 1/3, with 8 disks it's 1/8, and with 10983 disks it
would be 1/10983, because you're generating parity across the width
of your stripe, not as a tail to each stored byte on individual
devices.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored raidz

2010-07-26 Thread Eric D. Mudama

On Mon, Jul 26 at 11:51, Dav Banks wrote:

I wanted to test it as a backup solution. Maybe that's crazy in
itself but I want to try it.

Basically, once a week detach the 'backup' pool from the mirror,
replace the drives, add the new raidz to the mirror and let it
resilver and sit for a week.


Since you're already spending the disk drives for this that get
detached, it seems safer to me to just 'zfs send' to a minimal backup
system, and remove the extra drives from your primary server.  Less
overhead and the scrub can validate your backup copy at whatever
frequency you choose.

You don't even need the same pool layout on the backup machine.
Primary can be a stripe of mirrors, while your backup can be a wide
raidz2 setup.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 1tb SATA drives

2010-07-19 Thread Eric D. Mudama

On Fri, Jul 16 at 18:32, Jordan McQuown wrote:

  I'm curious to know what other people are running for HD's in white box
  systems? I'm currently looking at Seagate Barracuda's and Hitachi
  Deskstars. I'm looking at the 1tb models. These will be attached to an LSI
  expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This system
  will be used as a large storage array for backups and archiving.


Dell shipped us WD RE3 drives in the server we bought from them,
they've been working great and come in a 1TB size.  Not sure about the
expander, but they talk just fine to the 9211 HBAs.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-19 Thread Eric D. Mudama

On Wed, Jul 14 at 23:51, Tim Cook wrote:

Out of the fortune 500, I'd be willing to bet there's exactly zero
companies that use whitebox systems, and for a reason.
--Tim


Sure, some core SAP system or HR data warehouse runs on name-brand
gear, and maybe they have massive SANs with various capabilities that
run on name brand gear as well, but I'd guess that most every fortune
500 company buys some large number of generic machines as well.

(generic being anything from newegg build-it-yourself to the bargain
SKUs from major PC companies that may not have mission-critical
support contracts associated with them)

Any company that believes it can add more value in their IT supply
chain than the vendor they'd be buying from would be foolish not to
put energy into that space (if they can afford to.)  Google is but a
single example, though I am sure there are others.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 3 Data Disks: partitioning for 1 Raid0 and 1 Raidz1

2010-05-15 Thread Eric D. Mudama

On Sat, May 15 at  2:43, Jason Barr wrote:

Hello,

I want to slice these 3 disks into 2 partitions each and configure 1 Raid0 and 
1 Raidz1 on these 3.

What exactly has to be done? Using format and fdisk I know but not exactly how 
for this setup.

In case of disk failure: how do I replace the faulty one?


I think this is a bad idea.  Spreading multiple pools across the
partitions of the same set of drives will mean accesses to both pools
will have lots of extra seeks going from one portion of each drive to
the other.

If you are trying to get redundancy on a system without many disks,
i'd just mirror the root pool and put your data in there as well.  At
least that way, you won't be seeking across partitions.

Another option is to have a single boot drive, and a mirror of drives
for your data pool.  That's effectively how we do it at work, since
our SLA for system recovery allows a reinstall of the OS, and the
amount of custom configuration is minimal in our rpool.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hard drives for ZFS NAS

2010-05-12 Thread Eric D. Mudama

On Wed, May 12 at  8:45, Freddie Cash wrote:

  On Wed, May 12, 2010 at 4:05 AM, Emily Grettel
  [1]emilygrettelis...@hotmail.com wrote:

Hello,
Â
I've decided to replace my WD10EADS and WD10EARS drives as I've checked
the SMART values and they've accrued some insanely high numbers for the
load/unload counts (40K+ in 120 days on one!).
Â
I was leaning towards the Black drives but now I'm a bit worried about
the TLER lackingness which was a mistake made my previous sysadmin.
Â
I'm wondering what other people are using, even though the Green series
has let me down, I'm still a Western Digital gal.
Â
Would you recommend any of these for use in a ZFS NAS?
Â

  * 4x WD2003FYYS -
[2]http://www.wdc.com/en/products/Products.asp?DriveID=732Â [RE4]
  * 4x WD2002FYPS -
[3]http://www.wdc.com/en/products/products.asp?DriveID=610Â [Green]
  * 6x WD1002FBYS -
[4]http://www.wdc.com/en/products/Products.asp?DriveID=503Â [RE3]



We use the WD1002FBYS (1.0TB WD RE3) and haven't had an issue yet in
our Dell T610 chassis.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard disk buffer at 100%

2010-05-09 Thread Eric D. Mudama

On Sat, May  8 at 23:39, Ben Rockwood wrote:

The drive (c7t2d0)is bad and should be replaced.   The second drive
(c7t5d0) is either bad or going bad.  This is exactly the kind of
problem that can force a Thumper to it knees, ZFS performance is
horrific, and as soon as you drop the bad disks things magicly return to
normal.


Problem is the OP is mixing client 4k drives with 512b drives.  They
may not actually be bad, but they appear to be getting misused in
this application.

I doubt they're broken per say, they're just dramatically slower
than their peers in this workload.

As a replacement recommendation, we've been beating on the WD 1TB RE3
drives for 18 months or so, and we're happy with both performance and
the price for what we get.  $160/ea with a 5 year warranty.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance drop during scrub?

2010-04-28 Thread Eric D. Mudama

On Wed, Apr 28 at  1:34, Tonmaus wrote:

Zfs scrub needs to access all written data on all
disks and is usually
disk-seek or disk I/O bound so it is difficult to
keep it from hogging
the disk resources.  A pool based on mirror devices
will behave much
more nicely while being scrubbed than one based on
RAIDz2.


Experience seconded entirely. I'd like to repeat that I think we
need more efficient load balancing functions in order to keep
housekeeping payload manageable. Detrimental side effects of scrub
should not be a decision point for choosing certain hardware or
redundancy concepts in my opinion.


While there may be some possible optimizations, i'm sure everyone
would love the random performance of mirror vdevs, combined with the
redundancy of raidz3 and the space of a raidz1.  However, as in all
systems, there are tradeoffs.

To scrub a long lived, full pool, you must read essentially every
sector on every component device, and if you're going to do it in the
order in which your transactions occurred, it'll wind up devolving to
random IO eventually.

You can choose to bias your workloads so that foreground IO takes
priority over scrub, but then you've got the cases where people
complain that their scrub takes too long.  There may be knobs for
individuals to use, but I don't think overall there's a magic answer.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Oracle to no longer support ZFS on OpenSolaris?

2010-04-20 Thread Eric D. Mudama

On Tue, Apr 20 at 11:41, Don Turnbull wrote:
Not to be a conspiracy nut but anyone anywhere could have registered 
that gmail account and supplied that answer.  It would be a lot more 
believable from Mr Kay's Oracle or Sun account.


+1

Glad I wasn't the only one who noticed.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recomend sata controller 4 Home server with zfs raidz2 and 8x1tb hd

2010-04-16 Thread Eric D. Mudama

On Thu, Apr 15 at 23:57, Günther wrote:

hello

if you are looking for pci-e (8x), i would recommend sas/sata  controller
with lsi 1068E sas chip. they are nearly perfect with opensolaris.


For just a bit more, you can get the LSI SAS 9211-9i card which is
6Gbit/s.  It works fine for us, and does JBOD no problem.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making ZFS better: rm files/directories from snapshots

2010-04-16 Thread Eric D. Mudama

On Fri, Apr 16 at 13:56, Edward Ned Harvey wrote:

The typical problem scenario is:  Some user or users fill up the filesystem.
They rm some files, but disk space is not freed.  You need to destroy all
the snapshots that contain the deleted files, before disk space is available
again.

It would be nice if you could rm files from snapshots, without needing to
destroy the whole snapshot.

Is there any existing work or solution for this?


Doesn't that defeat the purpose of a snapshot?

If this is a real problem, I think that it calls for putting that
user's files in a separate filesystem that can have its snapshots
managed with a specific policy for addressing the usage model.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-16 Thread Eric D. Mudama

On Fri, Apr 16 at 10:05, Bob Friesenhahn wrote:
It is much more efficient (from a housekeeping perspective) if 
filesystem sectors map directly to SSD pages, but we are not there 
yet.


How would you stripe or manage a dataset across a mix of devices with
different geometries?  That would break many of the assumptions made
by filesystems today.

I would argue it's easier to let the device virtualize this mapping
and present a consistent interface, regardless of the underlying
geometry.

As a devil's advocate, I am still waiting for someone to post a URL 
to a serious study which proves the long-term performance advantages 
of TRIM.


I am absolutely sure these studies exist, but as to some entity
publishing a long term analysis that cost real money (many thousands
of dollars) to create, I have no idea if data like that exists in the
public domain where anyone can see it.  I can virtually guarantee
every storage, SSD and OS vendor is generating that data internally
however.

--eric



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-16 Thread Eric D. Mudama

On Fri, Apr 16 at 14:42, Miles Nordin wrote:

edm == Eric D Mudama edmud...@bounceswoosh.org writes:


  edm How would you stripe or manage a dataset across a mix of
  edm devices with different geometries?

the ``geometry'' discussed is 1-dimensional: sector size.

The way that you do it is to align all writes, and never write
anything smaller than the sector size.  The rule is very simple, and
you can also start or stop following it at any moment without
rewriting any of the dataset and still get the full benefit.


The response was regarding a filesystem with knowledge of the NAND
geometry, to align writes to exact page granularity.  My question was
how to implement that, if not all devices in a stripe set have the
same page size.

What you're suggesting is exactly what SSD vendors already do.  They
present a 512B standard host interface sector size, and perform their
own translations and management inside the device.


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-15 Thread Eric D. Mudama

On Tue, Apr 13 at  9:52, Bob Friesenhahn wrote:

On Mon, 12 Apr 2010, Eric D. Mudama wrote:


The advantage of TRIM, even in high end SSDs, is that it allows you to
effectively have additional considerable extra space available to
the device for garbage collection and wear management when not all
sectors are in use on the device.

For most users, with anywhere from 5-15% of their device unused, this
difference is significant and can improve performance greatly in some
workloads.  Without TRIM, the device has no way to use this space for
anything but tracking the data that is no longer active.

Based on the above, I think TRIM has the potential to help every SSD,
not just the cheap SSDs.


It seems that the above was missing.  What concrete evidence were 
you citing?


Nothing concrete.  Just makes sense to me that if ZFS has to work
harder to garbage collect as a pool approaches 100% full, so would
SSDs that use variants of CoW have to work harder to garbage collect
as they approach 100% written.

The purpose of TRIM is to tell the drive that some # of sectors are no
longer important so that it doesn't have to work as hard in its
internal garbage collection.

The value should be clearly demonstrated an fact (with many months of 
prototype testing with various devices) before the feature becomes a 
pervasive part of the operating system.  Every article I have read 
about the value of TRIM is pure speculation.


Perhaps it will be found that TRIM has more value for SAN storage (to 
reclaim space for accounting purposes) than for SSDs.


Perhaps, but that's not my gut feel.  I believe it has real value for
users in enterprise type workloads where performance comes down to a
simple calculation of reserve area on the SSD.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?

2010-04-14 Thread Eric D. Mudama

On Wed, Apr 14 at 13:16, David Dyer-Bennet wrote:

I don't get random hangs in normal use; so I haven't done anything to get
past this.


Interesting.  Win7-64 clients were locking up our 2009.06 server
within seconds while performing common operations like searching and
copying large directory trees.

Luckilly I could still rollback to 101b which worked fine (except for
a CIFS bug because of its age), and my roll-forward to b130 was
successful as well.  We now have our primary on b130 and our slave
server on b134, with no stability issues in either one.


I DO get hangs when funny stuff goes on, which may well be related to that
problem (at least they require a reboot).  Hmmm; I get hangs sometimes
when trying to send a full replication stream to an external backup drive,
and I have to reboot to recover from them.  I can live with this, in the
short term.  But now I'm feeling hopeful that they're fixed in what I'm
likely to be upgrading to next.


Yes, hopefully.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-12 Thread Eric D. Mudama

On Sun, Apr 11 at 22:45, James Van Artsdalen wrote:


PS. It is faster for an SSD to write a block of 0xFF than 0 and it's
possible some might make that optimization.  That's why I suggest
erase-to-ones rather than erase-to-zero.


Do you have any data to back this up?  While I understand the
underlying hardware implementation of NAND, I am not sure SSDs would
bother optimizing for this case.  A block erase would be just as
effective at hiding data.

I believe the reason strings of bits leak on rotating drives you've
overwritten (other than grown defects) is because of minute off-track
occurances while writing (vibration, particles, etc.), causing
off-center writes that can be recovered in the future with the right
equipment.

Flash doesn't have this analog positioning problem.  While each
electron well is effectively analog, there's no best guess work at
locating the wells.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-12 Thread Eric D. Mudama

On Mon, Apr 12 at 10:50, Bob Friesenhahn wrote:

On Mon, 12 Apr 2010, Tomas Ögren wrote:


For flash to overwrite a block, it needs to clear it first.. so yes,
clearing it out in the background (after erasing) instead of just before
the timing critical write(), you can make stuff go faster.


Yes of course.  Properly built SSDs include considerable extra space 
to support wear leveling, and this same space may be used to store 
erased blocks.  A block which is overwritten can simply be written 
to a block allocated from the extra free pool, and the existing block 
can be re-assigned to the free pool and scheduled for erasure.  This 
is a fairly simple recirculating algorithm which just happens to also 
assist with wear management.


The point originally made is that if you eventually write to every LBA
on a drive without TRIM support, your considerable extra space will
only include the extra physical blocks that the manufacturer provided
when they sold you the device, and for which you are paying.

The advantage of TRIM, even in high end SSDs, is that it allows you to
effectively have additional considerable extra space available to
the device for garbage collection and wear management when not all
sectors are in use on the device.

For most users, with anywhere from 5-15% of their device unused, this
difference is significant and can improve performance greatly in some
workloads.  Without TRIM, the device has no way to use this space for
anything but tracking the data that is no longer active.

Based on the above, I think TRIM has the potential to help every SSD,
not just the cheap SSDs.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-09 Thread Eric D. Mudama

On Sat, Apr 10 at  7:22, Daniel Carosone wrote:

On Fri, Apr 09, 2010 at 10:21:08AM -0700, Eric Andersen wrote:

 If I could find a reasonable backup method that avoided external
 enclosures altogether, I would take that route.


I'm tending to like bare drives.

If you have the chassis space, there are 5-in-3 bays that don't need
extra drive carriers, they just slot a bare 3.5 drive.  For e.g.

http://www.newegg.com/Product/Product.aspx?Item=N82E16817994077


I have a few of the 3-in-2 versions of that same enclosure from the
same manufacturer, and they installed in about 2 minutes in my tower
case.

The 5-in-3 doesn't have grooves in the sides like their 3-in-2 does,
so some cases may not accept the 5-in-3 if your case has tabs to
support devices like DVD drives in the 5.25 slots.

The grooves are clearly visible in this picture:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817994075

The doors are a bit light perhaps, but it works just fine for my
needs and holds drives securely.  The small fans are a bit noisy, but
since the box lives in the basement I don't really care.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-07 Thread Eric D. Mudama

On Wed, Apr  7 at 12:41, Jason S wrote:

And just to clarify as far as expanding this pool in the future my
only option is to add another 7 spindle RaidZ2 array correct?


That is correct, unless you want to use the -f option to force-allow
an asymmetric expansion of your pool.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?

2010-04-06 Thread Eric D. Mudama

On Tue, Apr  6 at 13:03, Markus Kovero wrote:

Install nexenta on a dell poweredge ? 
or one of these http://www.pogolinux.com/products/storage_director

FYI; More recent poweredges (R410,R710, possibly blades too, those with 
integrated Broadcom chips) are not working very well with opensolaris due 
broadcom network issues, hang-ups packet loss etc.
And as opensolaris is not supported OS Dell is not interested to fix these 
issues.


Our Dell T610 is and has been working just fine for the last year and
a half, without a single network problem.  Do you know if they're
using the same integrated part?

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Are there (non-Sun/Oracle) vendors selling OpenSolaris/ZFS based NAS Hardware?

2010-04-06 Thread Eric D. Mudama

On Tue, Apr  6 at 17:56, Markus Kovero wrote:

Our Dell T610 is and has been working just fine for the last year and
a half, without a single network problem.  Do you know if they're
using the same integrated part?



--eric


Hi, as I should have mentioned, integrated nics that cause issues
are using Broadcom BCM5709 chipset and these connectivity issues
have been quite widespread amongst linux people too, Redhat tries to
fix this; http://kbase.redhat.com/faq/docs/DOC-26837 but I believe
it's messed up in firmware somehow, as in our tests show
4.6.8-series firmware seems to be more stable.

And what comes to workarounds, disabling msi is bad if it creates
latency for network/disk controllers and disabling c-states from
Nehalem processors is just stupid (having no turbo, power saving
etc).

Definitely no go for storage imo.


Seems like this issue only occurs when MSI-X interrupts are enabled
for the BCM5709 chips, or am I reading it wrong?

If I type 'echo ::interrupts | mdb -k', and isolate for
network-related bits, I get the following output:


 IRQ  Vect IPL Bus   Trg Type   CPU Share APIC/INT# ISR(s)
 36   0x60 6   PCI   Lvl Fixed  3   1 0x1/0x4   bnx_intr_1lvl
 48   0x61 6   PCI   Lvl Fixed  2   1 0x1/0x10  bnx_intr_1lvl


Does this imply that my system is not in a vulnerable configuration?
Supposedly i'm losing some performance without MSI-X, but I'm not sure
in which environments or workloads we would notice since the load on
this server is relatively low, and the L2ARC serves data at greater
than 100MB/s (wire speed) without stressing much of anything.

The BIOS settings in our T610 are exactly as they arrived from Dell
when we bought it over a year ago.

Thoughts?
--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Eric D. Mudama

On Fri, Apr  2 at 11:14, Tirso Alonso wrote:

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?


There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0))

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=downloaddata_file_id=1066


Problem is that it only applies to devices that are = 50GB in size,
and the X25 in question is only 32GB.

That being said, I'd be skeptical of either the sourcing of the parts,
or else some other configuration feature on the drives (like HPA or
DCO) that is changing the capacity.  It's possible one of these is in
effect.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Write retry errors to SSD's on SAS backplane (mpt)

2010-04-01 Thread Eric D. Mudama

On Thu, Apr  1 at 19:08, Ray Van Dolson wrote:

Well, haven't yet been able to try the firmware suggestion, but we did
replace the backplane.  No change.

I'm not sure the firmware change would do any good either.  As it is
now, as long as the SSD drives are attached directly to the LSI
controller (no intermediary backplane), everything works fine -- no
errors.

As soon as the backplane is put in the equation -- and *only* for SSD
devices used as ZIL, we begin seeing the timeout/retries.

Seems like if it were a 1068E firmware issue we'd be seeing the issue
whether or not the backplane is in place... but maybe I'm missing
something.


It's possible that the backplane leads to enough signal degredation
that the setup is now stressing error paths that simply aren't hit
with the direct-connect cabling.

This is the sort of issue that adapter (or device or expander)
firmware changes can mitigate or exacerbate.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel SASUC8I - worth every penny

2010-03-28 Thread Eric D. Mudama

On Sun, Mar 28 at 16:55, James Van Artsdalen wrote:

* SII3132-based PCIe X1 SATA card (2 ports)


This chip is slow.

PCIe cards based on the Silicon Image 3124 are much faster, peaking
around 1 GB/sec aggregate throughput.  However, the 3124 is a PCI-X
chip and hence is used behind an Intel PCI serial-to-parallel bridge
for PCIe applications: this make for a more expensive card than a
3132.

All PCIe 3124 cards I have seen present all four 3124 ports as
external eSATA ports.  Perhaps someone else has seen a PCIe 3124
with internal SATA connectors?


The 3124 was one of the first NCQ-capable chips on the market, and
there are definitely internal versions of it around somewhere.

While they're typically mounted on PCI-X boards, the original
reference designs worked just fine in PCI slots.


As to the 3132, it's probably limited by the single bitlane.  I think
there's a 3134 variant that is PCI-e x4 which should be a lot faster.
Doesn't matter for rotating drives, but for SSDs it's important.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 configuration

2010-03-26 Thread Eric D. Mudama

On Fri, Mar 26 at  7:29, Edward Ned Harvey wrote:

   Using fewer than 4 disks in a raidz2 defeats the purpose of raidz2, as
   you will always be in a degraded mode.



  Freddie, are you nuts?  This is false.

  Sure you can use raidz2 with 3 disks in it.  But it does seem pointless to
  do that instead of a 3-way mirror.


One thing about mirrors is you can put each side of your mirror on a
different controller, so that any single controller failure doesn't
cause your pool to go down.

While controller failure rates are very low, using 16/24 or 14/21
drives for parity on a dataset seems crazy to me.  I know disks can be
unreliable, but they shouldn't be THAT unreliable.  I'd think that
spending fewer drives for hot redundancy and then spending some of
the balance on an isolated warm/cold backup solution would be more
cost effective.

http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html

Quoting from the summary, at some point, the system design will be
dominated by common failures and not the failure of independent
disks.

Another thought is that if heavy seeking is more likely to lead to
high temperature and/or drive failure, then reserving one or two slots
for an SSD L2ARC might be a good idea.  It'll take a lot of load off
of your spindles if your data set fits or mostly fits within the
L2ARC.  You'd need a lot of RAM to make use of a large L2ARC though,
just something to keep in mind.

We have a 32GB X25-E as L2ARC and though it's never more than ~5GB
full with our workloads, most every file access saturates the wire
(1.0 Gb/s ethernet) once the cache has warmed up, resulting in very
little IO to our spindles.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS hex dump diagrams?

2010-03-26 Thread Eric D. Mudama

On Fri, Mar 26 at 11:10, Sanjeev wrote:

On Thu, Mar 25, 2010 at 02:45:12PM -0700, John Bonomi wrote:

I'm sorry if this is not the appropriate place to ask, but I'm a
student and for an assignment I need to be able to show at the hex
level how files and their attributes are stored and referenced in
ZFS. Are there any resources available that will show me how this
is done?


You could try zdb.


Or just look at the source code.

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS backup configuration

2010-03-24 Thread Eric D. Mudama

On Wed, Mar 24 at 12:20, Wolfraider wrote:

Sorry if this has been dicussed before. I tried searching but I
couldn't find any info about it. We would like to export our ZFS
configurations in case we need to import the pool onto another
box. We do not want to backup the actual data in the zfs pool, that
is already handled through another program.


I'm pretty sure the configuration is embedded in the pool itself.
Just import on the new machine.  You may need --force/-f the pool
wasn't exported on the old system properly.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Eric D. Mudama

On Sat, Mar  6 at  3:15, Abdullah Al-Dahlawi wrote:


  hdd ONLINE   0 0 0
c7t0d0p3  ONLINE   0 0 0

  rpool   ONLINE   0 0 0
c7t0d0s0  ONLINE   0 0 0


I trimmed your zpool status output a bit.

Are those two the same device?  I'm barely familiar with solaris
partitioning and labels... what's the difference between a slice and a
partition?


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-06 Thread Eric D. Mudama

On Sat, Mar  6 at 15:04, Richard Elling wrote:

On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote:

On Sat, Mar  6 at  3:15, Abdullah Al-Dahlawi wrote:


 hdd ONLINE   0 0 0
   c7t0d0p3  ONLINE   0 0 0

 rpool   ONLINE   0 0 0
   c7t0d0s0  ONLINE   0 0 0


I trimmed your zpool status output a bit.

Are those two the same device?  I'm barely familiar with solaris
partitioning and labels... what's the difference between a slice and a
partition?


In this context, partition is an fdisk partition and slice is a
SMI or EFI labeled slice.  The SMI or EFI labeling tools (format,
prtvtoc, and ftmhard) do not work on partitions.  So when you
choose to use ZFS on a partition, you have no tools other than
fdisk to manage the space. This can lead to confusion... a bad
thing.


So in that context, is the above 'zpool status' snippet a bad thing
to do?



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What's the advantage of using multiple filesystems in a pool

2010-03-01 Thread Eric D. Mudama

On Sun, Feb 28 at 20:10, Erik Trimble wrote:
Obviously, having different filesystems gives you the ability to set 
different values for attributes, which may substantially improve 
performance or storage space depending on the data in that 
filesystem.  As an example above, I would consider turning 
compression on for your cloud/winbackups and possibly for cloud/data, 
but definitely not for either cloud/movies (assuming mpeg4 or similar 
files) or cloud/music.


On Win7, the automatic backup service saves chunked ~200MB .zip files.
These are unlikely to compress very well.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale ZFS deployments out there (200 disks)

2010-02-26 Thread Eric D. Mudama

On Thu, Feb 25 at 20:21, Bob Friesenhahn wrote:

On Thu, 25 Feb 2010, Alastair Neil wrote:


I do not know and I don't think anyone would deploy a system in that way with 
UFS. 
This is the model that is imposed in order to take full advantage of zfs 
advanced
features such as snapshots, encryption and compression and I know many 
universities
in particular are eager to adopt it for just that reason, but are stymied by 
this
problem.


It was not really a serious question but it was posed to make a 
point. However, it would be interesting to know if there is another 
type of filesystem (even on Linux or some other OS) which is able to 
reasonably and efficiently support 16K mounted and exported file 
systems.


Eventually Solaris is likely to work much better for this than it 
does today, but most likely there are higher priorities at the 
moment.


I agree with the above, but the best practices guide:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_file_service_for_SMB_.28CIFS.29_or_SAMBA

states in the SAMBA section that Beware that mounting 1000s of file
systems, will impact your boot time.  I'd say going from a 2-3 minute
boot time to a 4+ hour boot time is more than just impact.  That's
getting hit by a train.

Might be useful for folks, if the above document listed a few concrete
datapoints of boot time scaling with the number of filesystems or
something similar.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with hundreds of millions of files

2010-02-24 Thread Eric D. Mudama

On Wed, Feb 24 at 14:09, Bob Friesenhahn wrote:
400 million tiny files is quite a lot and I would hate to use 
anything but mirrors with so many tiny files.


And at 400 million, you're in the realm of needing mirrors of SSDs,
with their fast random reads.  Even at the 500+ IOPS of good SAS
drives, you're looking at a TON of spindles to move through 400
million 1KB files quickly.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-16 Thread Eric D. Mudama

On Tue, Feb 16 at  9:44, Brian E. Imhoff wrote:

But, at the end of the day, this is quite a bomb: A single raidz2
vdev has about as many IOs per second as a single disk, which could
really hurt iSCSI performance.

If I have to break 24 disks up in to multiple vdevs to get the
expected performance might be a deal breaker.  To keep raidz2
redundancy, I would have to lose..almost half of the available
storage to get reasonable IO speeds.


ZFS is quite flexible.  You can put multiple vdevs in a pool, and dial
your performance/redundancy just about wherever you want them.

24 disks could be:

12x mirrored vdevs (best random IO, 50% capacity, any 1 failure absorbed, up to 
12 w/ limits)
6x 4-disk raidz vdevs (75% capacity, any 1 failure absorbed, up to 6 with 
limits)
4x 6-disk raidz vdevs (~83% capacity, any 1 failure absorbed, up to 4 with 
limits)
4x 6-disk raidz2 vdevs (~66% capacity, any 2 failures absorbed, up to 8 with 
limits)
1x 24-disk raidz2 vdev (~92% capacity, any 2 failures absorbed, worst random IO 
perf)
etc.

I think the 4x 6-disk raidz2 vdev setup is quite commonly used with 24
disks available, but each application is different.  We use mirrors
vdevs at work, with a separate box as a live backup using raidz of
larger SATA drives.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-09 Thread Eric D. Mudama

On Tue, Feb  9 at  2:36, Kjetil Torgrim Homme wrote:

Daniel Carosone d...@geek.com.au writes:


In that context, I haven't seen an answer, just a conclusion:

 - All else is not equal, so I give my money to some other hardware
   manufacturer, and get frustrated that Sun won't let me buy the
   parts I could use effectively and comfortably.


no one is selling disk brackets without disks.  not Dell, not EMC, not
NetApp, not IBM, not HP, not Fujitsu, ...


http://discountechnology.com/Products/SCSI-Hard-Drive-Caddies-Trays



--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] b131 OpenSol x86, 4 disk RAIDZ-1 scrub performance way down.

2010-02-01 Thread Eric D. Mudama

On Mon, Feb  1 at 16:12, Jake Carroll wrote:

Hi all.

Recent builds (b129, b130 and b131) have had me noticing some zpool
performance issues when scrubbing.

Running bare into some cheap SATA controllers, on a cheap mobo,
running 6GB of DDR2 + an Intel Q6600, with 4 * 1TB Samsung consumer
grade SATA drives, I've been accustomed to seeing around 150 to
180MB/sec scrubs on a single pool.

Until b129, 130 and 131 hit. I've got dedup=on, compression=on
(default, not gzip), no dedupe verify etc.


I think with dedupe, you've turned your scrub into a mostly random
operation.


I now see maybe 10MB/sec across 4 drives on scrub. Turning dedupe
off seemingly didn't help.


Disabling dedupe doesn't change the state of existing data.  Unless
you've disabled dedupe, then re-copied all your data, I believe your
existing data is all still in the dedupe state.


Are we seeing an issue of dedupe here, or something less complex
entirely?


sounds like dedupe to me... My non-dedupe zpools are scrubbing at the
same rate as ever in b130 on multiple servers.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >