Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Sašo Kiselkov
On 01/07/2011 10:26 AM, Darren J Moffat wrote:
 On 06/01/2011 23:07, David Magda wrote:
 On Jan 6, 2011, at 15:57, Nicolas Williams wrote:

 Fletcher is faster than SHA-256, so I think that must be what you're
 asking about: can Fletcher+Verification be faster than
 Sha256+NoVerification?  Or do you have some other goal?

 Would running on recent T-series servers, which have have on-die
 crypto units, help any in this regard?

 The on chip SHA-256 implementation is not yet used see:

 http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via

 Note that the fix I integrated only uses a software implementation of
 SHA256 on the T5120 (UltraSPARC T2) and is not (yet) using the on CPU
 hardware implementation of SHA256.  The reason for this is to do with
 boot time availability of the Solaris Cryptographic Framework and the
 need to have ZFS as the root filesystem.

 Not yet changed it turns out to be quite complicated to fix due to
 very early boot issues.

Would it be difficult to implement both methods and allow ZFS to switch
to the hardware-accelerated crypto backend at runtime after it has been
brought up and initialized? It seems like one heck of a feature
(essentially removing most of the computational complexity of dedup).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Sašo Kiselkov
On 01/07/2011 01:15 PM, Darren J Moffat wrote:
 On 07/01/2011 11:56, Sašo Kiselkov wrote:
 On 01/07/2011 10:26 AM, Darren J Moffat wrote:
 On 06/01/2011 23:07, David Magda wrote:
 On Jan 6, 2011, at 15:57, Nicolas Williams wrote:

 Fletcher is faster than SHA-256, so I think that must be what you're
 asking about: can Fletcher+Verification be faster than
 Sha256+NoVerification?  Or do you have some other goal?

 Would running on recent T-series servers, which have have on-die
 crypto units, help any in this regard?

 The on chip SHA-256 implementation is not yet used see:

 http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via

 Note that the fix I integrated only uses a software implementation of
 SHA256 on the T5120 (UltraSPARC T2) and is not (yet) using the on CPU
 hardware implementation of SHA256.  The reason for this is to do with
 boot time availability of the Solaris Cryptographic Framework and the
 need to have ZFS as the root filesystem.

 Not yet changed it turns out to be quite complicated to fix due to
 very early boot issues.

 Would it be difficult to implement both methods and allow ZFS to switch
 to the hardware-accelerated crypto backend at runtime after it has been
 brought up and initialized? It seems like one heck of a feature

 Wither it is difficult or not depends on your level of familiarity
 with ZFS, boot and the cryptographic framework ;-)

 For me no it wouldn't be difficult but it still isn't completely trivial.

 (essentially removing most of the computational complexity of dedup).

 Most of the data I've seen on the performance impact of dedup is not
 coming from the SHA256 computation it is mostly about the additional
 IO to deal with the DDT.   Though lowering the overhead that SHA256
 does add is always a good thing.

Well, seeing as all mainline ZFS development is now happening behind
closed doors, all I can really do is ask for features and hope Oracle
implements them :-). Nevertheless, thanks for the clarification.

BR,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Sašo Kiselkov
On 04/08/2011 05:20 PM, Mark Sandrock wrote:
 
 On Apr 8, 2011, at 7:50 AM, Evaldas Auryla evaldas.aur...@edqm.eu wrote:
 
 On 04/ 8/11 01:14 PM, Ian Collins wrote:
 You have built-in storage failover with an AR cluster;
 and they do NFS, CIFS, iSCSI, HTTP and WebDav
 out of the box.

 And you have fairly unlimited options for application servers,
 once they are decoupled from the storage servers.

 It doesn't seem like much of a drawback -- although it
 may be for some smaller sites. I see AR clusters going in
 in local high schools and small universities.

 Which is all fine and dandy if you have a green field, or are willing to
 re-architect your systems.  We just wanted to add a couple more x4540s!


 Hi, same here, it's a sad news that Oracle decided to stop x4540s production 
 line. Before, ZFS geeks had choice - buy 7000 series if you want quick out 
 of the box storage with nice GUI, or build your own storage with x4540 
 line, which by the way has brilliant engineering design, the choice is gone 
 now.
 
 Okay, so what is the great advantage
 of an X4540 versus X86 server plus
 disk array(s)?
 
 Mark

Several:

 1) Density: The X4540 has far greater density than 1U server + Sun's
J4200 or J4400 storage arrays. The X4540 did 12 disks / 1RU, whereas a
1U + 2xJ4400 only manages ~5.3 disks / 1RU.

 2) Number of components involved: server + disk enclosure means you
have more PSUs which can die on you, more cabling to accidentally
disconnect and generally more hassle with installation.

 3) Spare management: With the X4540 you only have to have one kind of
spare component: the server. With servers + enclosures, you might need
to keep multiple.

I agree that besides 1), both 2) a 3) are a relatively trivial problem
to solve. Of course, server + enclosure builds do have their place, such
as when you might need to scale, but even then you could just hook them
up to a X4540 (or purchase a new one - I never quite understood why the
storage-enclosure-only variant of the X4540 case was more expensive than
an identical server).

In short, I think the X4540 was an elegant and powerful system that
definitely had its market, especially in my area of work (digital video
processing - heavy on latency, throughput and IOPS - an area, where the
7000-series with its over-the-network access would just be a totally
useless brick).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Sašo Kiselkov
On 04/08/2011 06:59 PM, Darren J Moffat wrote:
 On 08/04/2011 17:47, Sašo Kiselkov wrote:
 In short, I think the X4540 was an elegant and powerful system that
 definitely had its market, especially in my area of work (digital video
 processing - heavy on latency, throughput and IOPS - an area, where the
 7000-series with its over-the-network access would just be a totally
 useless brick).
 
 As an engineer I'm curious have you actually tried a suitably sized
 S7000 or are you assuming it won't perform suitably for you ?
 

No, I haven't tried a S7000, but I've tried other kinds of network
storage and from a design perspective, for my applications, it doesn't
even make a single bit of sense. I'm talking about high-volume real-time
video streaming, where you stream 500-1000 (x 8Mbit/s) live streams from
a machine over UDP. Having to go over the network to fetch the data from
a different machine is kind of like building a proxy which doesn't
really do anything - if the data is available from a different machine
over the network, then why the heck should I just put another machine in
the processing path? For my applications, I need a machine with as few
processing components between the disks and network as possible, to
maximize throughput, maximize IOPS and minimize latency and jitter.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Network video streaming [Was: Re: X4540 no next-gen product?]

2011-04-08 Thread Sašo Kiselkov
On 04/08/2011 07:22 PM, J.P. King wrote:
 
 No, I haven't tried a S7000, but I've tried other kinds of network
 storage and from a design perspective, for my applications, it doesn't
 even make a single bit of sense. I'm talking about high-volume real-time
 video streaming, where you stream 500-1000 (x 8Mbit/s) live streams from
 a machine over UDP. Having to go over the network to fetch the data from
 a different machine is kind of like building a proxy which doesn't
 really do anything - if the data is available from a different machine
 over the network, then why the heck should I just put another machine in
 the processing path? For my applications, I need a machine with as few
 processing components between the disks and network as possible, to
 maximize throughput, maximize IOPS and minimize latency and jitter.
 
 I can't speak for this particular situation or solution, but I think in
 principle you are wrong.  Networks are fast.  Hard drives are slow.  Put
 a 10G connection between your storage and your front ends and you'll
 have the bandwidth[1].  Actually if you really were hitting 1000x8Mbits
 I'd put 2, but that is just a question of scale.  In a different
 situation I have boxes which peak at around 7 Gb/s down a 10G link (in
 reality I don't need that much because it is all about the IOPS for
 me).  That is with just twelve 15k disks.  Your situation appears to be
 pretty ideal for storage hardware, so perfectly achievable from an
 appliance.

I envision this kind of scenario (using my fancy ASCII art skills :-)):

|| = streaming server  ||
+-+ SAS  +-+ PCI-e +-+ Ethernet ++
|DISKS| === | RAM |  | NIC | === | client |
+-+  +-+   +-+  ++

And you are advocating for this kind of scenario:

||  network storage = ||
+-+ SAS  +-+ PCI-e +-+ Ethernet
|DISKS| === | RAM |  | NIC |  ...
+-+  +-+   +-+

|| = streaming server == ||
+-+ PCI-e +-+ PCI-e +-+ Ethernet ++
... == | NIC |  | RAM |  | NIC | === | client |
+-+   +-+   +-+  ++

I'm not constrained on CPU (so hooking up multiple streaming servers to
one backend storage doesn't really make sense).
So what exactly what does this scenario add to my needs (besides needing
extra hardware in both the storage and server (10G NICs, cabling,
modules, etc.)? I'm not saying no, I'd love to improve the throughput,
IOPS and latency characteristics of my systems.

 I can't speak for the S7000 range.  I ignored that entire product line
 because when I asked about it the markup was insane compared to just
 buying X4500/X4540s.  The price for Oracle kit isn't remotely tenable, so
 the death of the X45xx range is a moot point for me anyway, since I
 couldn't afford it.
 
 [1] Just in case, you also shouldn't be adding any particularly
 significant latency either.  Jitter, maybe, depending on the specifics
 of the streams involved.
 
 Saso
 
 Julian
 -- 
 Julian King
 Computer Officer, University of Cambridge, Unix Support
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-09 Thread Sašo Kiselkov
On 04/09/2011 01:41 PM, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Julian King

 Actually I think our figures more or less agree. 12 disks = 7 mbits
 48 disks = 4x7mbits
 
 I know that sounds like terrible performance to me.  Any time I benchmark
 disks, a cheap generic SATA can easily sustain 500Mbit, and any decent drive
 can easily sustain 1Gbit.

I think he mistyped and meant 7gbit/s.

 Of course it's lower when there's significant random seeking happening...
 But if you have a data model which is able to stream sequentially, the above
 is certainly true.

Unfortunately, this is exactly my scenario, where I want to stream large
volumes of data in many concurrent threads over large datasets which
have no hope of fitting in RAM or L2ARC and with generally very little
locality.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Monitoring disk seeks

2011-05-19 Thread Sašo Kiselkov
Hi all,

I'd like to ask whether there is a way to monitor disk seeks. I have an
application where many concurrent readers (50) sequentially read a
large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
read/write ops using iostat, but that doesn't tell me how contiguous the
data is, i.e. when iostat reports 500 read ops, does that translate to
500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-19 Thread Sašo Kiselkov
On 05/19/2011 03:35 PM, Tomas Ögren wrote:
 On 19 May, 2011 - Sa??o Kiselkov sent me these 0,6K bytes:
 
 Hi all,

 I'd like to ask whether there is a way to monitor disk seeks. I have an
 application where many concurrent readers (50) sequentially read a
 large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
 read/write ops using iostat, but that doesn't tell me how contiguous the
 data is, i.e. when iostat reports 500 read ops, does that translate to
 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!
 
 Get DTraceToolkit and check out the various things under Disk and FS,
 might help.
 
 /Tomas

Thank you all for the tips, I'll try to poke around using the DTrackToolkit.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-20 Thread Sašo Kiselkov
On 05/19/2011 07:47 PM, Richard Elling wrote:
 On May 19, 2011, at 5:35 AM, Sašo Kiselkov wrote:
 
 Hi all,

 I'd like to ask whether there is a way to monitor disk seeks. I have an
 application where many concurrent readers (50) sequentially read a
 large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
 read/write ops using iostat, but that doesn't tell me how contiguous the
 data is, i.e. when iostat reports 500 read ops, does that translate to
 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!
 
 In general, this is hard to see from the OS.  In Solaris, the default I/O
 flowing through sd gets sorted based on LBA before being sent to the
 disk. If the disks gets more than 1 concurrent I/O request (10 is the default
 for Solaris-based ZFS) then the disk can resort or otherwise try to optimize
 the media accesses.
 
 As others have mentioned, iopattern is useful for looking a sequential 
 patterns. I've made some adjustments for the version at
 http://www.richardelling.com/Home/scripts-and-programs-1/iopattern
 
 You can see low-level SCSI activity using scsi.d, but I usually uplevel that
 to using iosnoop -Dast which shows each I/O and its response time.
 Note that the I/Os can complete out-of-order on many devices. The only 
 device I know that is so fast and elegant that it always completes in-order 
 is the DDRdrive.
 
 For detailed analysis of iosnoop data, you will appreciate a real statistics
 package. I use JMP, but others have good luck with R.
  -- richard

Thank you, the iopattern script seems to be quite close to what I
wanted. The percentage split between random and sequential I/O is pretty
much what I needed to know.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-24 Thread Sašo Kiselkov
On 05/24/2011 03:08 PM, a.sm...@ukgrid.net wrote:
 Hi,
 
   see the seeksize script on this URL:
 
 http://prefetch.net/articles/solaris.dtracetopten.html
 
 Not used it but looks neat!
 
 cheers Andy.

I already did and it does the job just fine. Thank you for your kind
suggestion.

BR,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Fixing txg commit frequency

2011-06-24 Thread Sašo Kiselkov
Hi All,

I'd like to ask about whether there is a method to enforce a certain txg
commit frequency on ZFS. I'm doing a large amount of video streaming
from a storage pool while also slowly continuously writing a constant
volume of data to it (using a normal file descriptor, *not* in O_SYNC).
When reading volume goes over a certain threshold (and average pool load
over ~50%), ZFS thinks it's running out of steam on the storage pool and
starts committing transactions more often which results in even greater
load on the pool. This leads to a sudden spike in I/O utilization on the
pool in roughly the following method:

 # streaming clientspool load [%]
15  8%
20 11%
40 22%
60 33%
80 44%
--- around here txg timeouts start to shorten ---
85 60%
90 70%
95 85%

My application does a fair bit of caching and prefetching, so I have
zfetch disabled and primarycache set to only metadata. Also, reads
happen (on a per client basis) relatively infrequently, so I can easily
take it if the pool stops reading for a few seconds and just writes
data. The problem is, ZFS starts alternating between reads and writes
really quickly, which in turn starves me on IOPS and results in a huge
load spike. Judging on load numbers up to around 80 concurrent clients,
I suspect I could go up to 150 concurrent clients on this pool, but
because of this spike I top out at around 95-100 concurrent clients.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-26 Thread Sašo Kiselkov
On 06/26/2011 06:17 PM, Richard Elling wrote:
 
 On Jun 24, 2011, at 5:29 AM, Sašo Kiselkov wrote:
 
 Hi All,

 I'd like to ask about whether there is a method to enforce a certain txg
 commit frequency on ZFS. I'm doing a large amount of video streaming
 from a storage pool while also slowly continuously writing a constant
 volume of data to it (using a normal file descriptor, *not* in O_SYNC).
 When reading volume goes over a certain threshold (and average pool load
 over ~50%), ZFS thinks it's running out of steam on the storage pool and
 starts committing transactions more often which results in even greater
 load on the pool. This leads to a sudden spike in I/O utilization on the
 pool in roughly the following method:

 # streaming clients  pool load [%]
  15  8%
  20 11%
  40 22%
  60 33%
  80 44%
 --- around here txg timeouts start to shorten ---
  85 60%
90   70%
  95 85%
 
 What is a pool load? We expect 100% utilization during the txg commit,
 anything else is a waste.
 
 I suspect that you actually want more, smaller commits to spread the load
 more evenly. This is easy to change, but unless you can tell us what OS
 you are running, including version, we don't have a foundation to build upon.
  -- richard

Pool load is a 60 seconds average of the aggregated util percentages as
reported by iostat -D for the disks which comprise the pool (So I run
iostat -Dn {pool-disks} 60 and compute the load for each row printed
as an average of the util columns). Interestingly enough, when
watching 1-second updates in iostat I never see util hit 100% during a
txg commit, even if it takes two or more seconds to complete. This tells
me that the disks still have enough performance headroom so that zfs
doesn't really need to shorten the interval at which commits occur.

I'm running oi_148, and all pools are zfs version 28.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-29 Thread Sašo Kiselkov
On 06/29/2011 02:33 PM, Sašo Kiselkov wrote:
 Also there is a buffer-size limit, like this (384Mb):
 set zfs:zfs_write_limit_override = 0x1800

 or on command-line like this:
 # echo zfs_write_limit_override/W0t402653184 | mdb -kw
 
 Currently my value for this is 0. How should I set it? I'm writing
 ~15MB/s and would like txg flushes to occur at most once every 10
 seconds. Should I set it to 150MB then?
 
 We had similar spikes with big writes to a Thumper with SXCE in the pre-90's
 builds, when the system would stall for seconds while flushing a 30-second 
 TXG
 full of data. Adding a reasonable megabyte limit solved the unresponsiveness
 problem for us, by making these flush-writes rather small and quick.
 
 I need to do the opposite - I don't need to shorten the interval of
 writes, I need to increase it. Can I do that using zfs_write_limit_override?

Just as a folloup, I've had a look at the tunables in dsl_pool.c and
found that I could potentially influence the write pressure calculation
by tuning zfs_txg_synctime_ms - do you think increasing this value from
its default (1000ms) help me lower the write scheduling frequency? (I
don't mind if a txg write takes even twice as long, my application
buffers are on average 6 seconds long.)

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-29 Thread Sašo Kiselkov
On 06/27/2011 11:59 AM, Jim Klimov wrote:
 
 I'd like to ask about whether there is a method to enforce a 
 certain txg
 commit frequency on ZFS. 
  
 Well, there is a timer frequency based on TXG age (i.e 5 sec 
 by default now), in /etc/system like this:
  
 set zfs:zfs_txg_synctime = 5

When trying to read the value through mdb I get:

# echo zfs_txg_synctime::print | mdb -k
mdb: failed to dereference symbol: unknown symbol name

Is this some new addition in S11E?

 Also there is a buffer-size limit, like this (384Mb):
 set zfs:zfs_write_limit_override = 0x1800
 
 or on command-line like this:
 # echo zfs_write_limit_override/W0t402653184 | mdb -kw

Currently my value for this is 0. How should I set it? I'm writing
~15MB/s and would like txg flushes to occur at most once every 10
seconds. Should I set it to 150MB then?

 We had similar spikes with big writes to a Thumper with SXCE in the pre-90's
 builds, when the system would stall for seconds while flushing a 30-second TXG
 full of data. Adding a reasonable megabyte limit solved the unresponsiveness
 problem for us, by making these flush-writes rather small and quick.

I need to do the opposite - I don't need to shorten the interval of
writes, I need to increase it. Can I do that using zfs_write_limit_override?

Thanks.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-30 Thread Sašo Kiselkov
On 06/30/2011 01:10 PM, Jim Klimov wrote:
 2011-06-30 11:47, Sašo Kiselkov пишет:
 On 06/30/2011 02:49 AM, Jim Klimov wrote:
 2011-06-30 2:21, Sašo Kiselkov пишет:
 On 06/29/2011 02:33 PM, Sašo Kiselkov wrote:
 Also there is a buffer-size limit, like this (384Mb):
 set zfs:zfs_write_limit_override = 0x1800

 or on command-line like this:
 # echo zfs_write_limit_override/W0t402653184 | mdb -kw
 Currently my value for this is 0. How should I set it? I'm writing
 ~15MB/s and would like txg flushes to occur at most once every 10
 seconds. Should I set it to 150MB then?

 We had similar spikes with big writes to a Thumper with SXCE in the
 pre-90's
 builds, when the system would stall for seconds while flushing a
 30-second TXG
 full of data. Adding a reasonable megabyte limit solved the
 unresponsiveness
 problem for us, by making these flush-writes rather small and quick.
 I need to do the opposite - I don't need to shorten the interval of
 writes, I need to increase it. Can I do that using
 zfs_write_limit_override?
 Just as a folloup, I've had a look at the tunables in dsl_pool.c and
 found that I could potentially influence the write pressure calculation
 by tuning zfs_txg_synctime_ms - do you think increasing this value from
 its default (1000ms) help me lower the write scheduling frequency? (I
 don't mind if a txg write takes even twice as long, my application
 buffers are on average 6 seconds long.)

 Regards,
 -- 
 Saso
 It might help. In my limited testing on oi_148a,
 it seems that zfs_txg_synctime_ms and zfs_txg_timeout
 are linked somehow (i.e. changing one value changed the
 other accordingly). So in effect they may be two names
 for the same tunable (one in single units of secs, another
 in thousands of msecs).
 Well, to my understanding, zfs_txg_timeout is the timer limit on
 flushing pending txgs to disk - if the timer fires the current txg is
 written to disk regardless of its size. Otherwise the txg scheduling
 algorithm should take into account I/O pressure on the pool, estimate
 the remaining write bandwidth and fire when it estimates that a txg
 commit would overflow zfs_txg_synctime[_ms]. I tried increasing this
 value to 2000 or 3000, but without an effect - prehaps I need to set it
 at pool mount time or in /etc/system. Could somebody with more knowledge
 of these internals please chime in?
 
 
 Somewhere in our discussion the Reply-to-all was lost.
 Back to the list :)
 
 Saso: Did you try setting both the timeout limit and the
 megabyte limit values, and did you see system IO patterns
 correlate with these values?
 
 My understanding was lke yours above, so if things are
 different in reality - I'm interested to know too.
 
 PS: I don't think you wrote: which OS version do you use?

Thanks for the suggestions, I'll try them out. I'm running oi_148.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-30 Thread Sašo Kiselkov
On 06/30/2011 01:33 PM, Jim Klimov wrote:
 2011-06-30 15:22, Sašo Kiselkov пишет:
 I tried increasing this
 value to 2000 or 3000, but without an effect - prehaps I need to set it
 at pool mount time or in /etc/system. Could somebody with more
 knowledge
 of these internals please chime in?

 
 And about this part - it was my understanding and experience
 (from SXCE) that these values can be set at run-time and are
 used as soon as set (or maybe in a few TXGs - but visibly in
 real-time).
 
 Also I've seen instant result from setting the TXG sync times
 on oi_148a with little loads (in my thread about trying to
 account for some 2Mb writes to my root pool) - this could be
 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG
 timeout currently set value.
 

Hm, it appears I'll have to do some reboots and more extensive testing.
I tried tuning various settings and then returned everything back to the
defaults. Yet, now I can ramp the number of concurrent output streams to
~170 instead of the original 95 (even then the pool still has capacity
left, I'm actually running out of CPU power). The txg commit occurs at
roughly every 15 (or so) seconds, which is what I wanted. Strange that
this occurs even after I returned everything to the defaults... I'll try
doing some more testing on this once I move the production deployment to
a different system and I'll have more time to experiment with this
machine. Anyways, thanks for the suggestions, it helped a lot.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-30 Thread Sašo Kiselkov
On 06/30/2011 11:56 PM, Sašo Kiselkov wrote:
 On 06/30/2011 01:33 PM, Jim Klimov wrote:
 2011-06-30 15:22, Sašo Kiselkov пишет:
 I tried increasing this
 value to 2000 or 3000, but without an effect - prehaps I need to set it
 at pool mount time or in /etc/system. Could somebody with more
 knowledge
 of these internals please chime in?


 And about this part - it was my understanding and experience
 (from SXCE) that these values can be set at run-time and are
 used as soon as set (or maybe in a few TXGs - but visibly in
 real-time).

 Also I've seen instant result from setting the TXG sync times
 on oi_148a with little loads (in my thread about trying to
 account for some 2Mb writes to my root pool) - this could be
 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG
 timeout currently set value.

 
 Hm, it appears I'll have to do some reboots and more extensive testing.
 I tried tuning various settings and then returned everything back to the
 defaults. Yet, now I can ramp the number of concurrent output streams to
 ~170 instead of the original 95 (even then the pool still has capacity
 left, I'm actually running out of CPU power). The txg commit occurs at
 roughly every 15 (or so) seconds, which is what I wanted. Strange that
 this occurs even after I returned everything to the defaults... I'll try
 doing some more testing on this once I move the production deployment to
 a different system and I'll have more time to experiment with this
 machine. Anyways, thanks for the suggestions, it helped a lot.
 
 Regards,
 --
 Saso

Just a follow correction: one parameter was indeed changed:
zfs_write_limit_inflated. In the source it's set to zero, I've set it to
0x2.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HP JBOD D2700 - ok?

2011-11-30 Thread Sašo Kiselkov
On 11/30/2011 02:40 PM, Edmund White wrote:
 Absolutely. 
 
 I'm using a fully-populated D2700 with an HP ProLiant DL380 G7 server
 running NexentaStor.
 
 On the HBA side, I used the LSI 9211-8i 6G controllers for the server's
 internal disks (boot, a handful of large disks, Pliant SSDs for L2Arc).
 There is also a DDRDrive for ZIL. To connect to the D2700 enclosure, I
 used 2 x LSI 9205 6G HBAs; one 4-lane SAS cable per storage controller on
 the D2700.
 
 These were setup with MPxIO (dual controllers, dual paths, dual-ported
 disks) and required a slight bit of tuning of /kernel/drv/scsi_vhci.conf,
 but the performance is great now. The enclosure is supported and I've been
 able to setup drive slot maps and control disk LED's, etc.
 

Coincidentally, I'm also thinking about getting a few D2600 enclosures,
but I've been considering attaching them via a pair of HP SC08Ge 6G SAS
HBAs. Has anybody had any experience with these HBAs? According to a few
searches on the Internet, it should be a rebranded LSI9200-8e.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2012-01-06 Thread Sašo Kiselkov
On 07/01/2011 12:01 AM, Sašo Kiselkov wrote:
 On 06/30/2011 11:56 PM, Sašo Kiselkov wrote:
 Hm, it appears I'll have to do some reboots and more extensive testing.
 I tried tuning various settings and then returned everything back to the
 defaults. Yet, now I can ramp the number of concurrent output streams to
 ~170 instead of the original 95 (even then the pool still has capacity
 left, I'm actually running out of CPU power). The txg commit occurs at
 roughly every 15 (or so) seconds, which is what I wanted. Strange that
 this occurs even after I returned everything to the defaults... I'll try
 doing some more testing on this once I move the production deployment to
 a different system and I'll have more time to experiment with this
 machine. Anyways, thanks for the suggestions, it helped a lot.

 Regards,
 --
 Saso
 
 Just a follow correction: one parameter was indeed changed:
 zfs_write_limit_inflated. In the source it's set to zero, I've set it to
 0x2.

So it seems I was wrong after all and it didn't help. So the question
remains: is there a way how to force ZFS *NOT* to commit a txg before a
certain minimum amount of data has accumulated in it, or before the txg
timeout is reached?

All the best,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Windows 8 ReFS (OT)

2012-01-17 Thread Sašo Kiselkov
On 01/17/2012 01:06 AM, David Magda wrote:
 Kind of off topic, but I figured of some interest to the list. There will be 
 a new file system in Windows 8 with some features that we all know and love 
 in ZFS:
 
 As mentioned previously, one of our design goals was to detect and correct 
 corruption. This not only ensures data integrity, but also improves system 
 availability and online operation. Thus, all ReFS metadata is check-summed 
 at the level of a B+ tree page, and the checksum is stored independently 
 from the page itself. [...] Once ReFS detects such a failure, it interfaces 
 with Storage Spaces to read all available copies of data and chooses the 
 correct one based on checksum validation. It then tells Storage Spaces to 
 fix the bad copies based on the good copies. All of this happens 
 transparently from the point of view of the application.

Looks like what the Btrfs people were trying to do.

--
S
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
Hi,

I'm getting weird errors while trying to install openindiana 151a on a
Dell R715 with a PERC H200 (based on an LSI SAS 2008). Any time the OS
tries to access the drives (for whatever reason), I get this dumped into
syslog:

genunix: WARNING: Device
/pci@0,0/pci1002,5a18@4/pci10b58424@0/pci10b5,8624@0/pci1028,1f1e@0/iport@40/disk@w5c0f01004ebe,0
failed to power up
genunix: WARNING: Device
/pci@0,0/pci1002,5a18@4/pci10b58424@0/pci10b5,8624@0/pci1028,1f1e@0/iport@80/disk@w5c0f01064e9e,0
failed to power up

(these are two WD 300GB 10k SAS drives)

When this log message shows up, I can see each drive light up the drive
LED briefly and then it turns off, so apparently the OS tried to
initialize the drives, but somehow failed and gave up.

Consequently, when I try and access them in format(1), they show up as
an unknown type and installing openindiana on them fails while the
installer is trying to do fdisk.

Has anybody got any idea what I can do to the controller/drives/whatever
to fix the failed to power up problem? One would think that a LSI SAS
2008 chip would be problem free under Solaris (the server even lists
Oracle Solaris as an officially supported OS), but alas, I have yet to
succeed.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
On 05/16/2012 09:45 AM, Koopmann, Jan-Peter wrote:
 Hi,
 
 are those DELL branded WD disks? DELL tends to manipulate the
 firmware of the drives so that power handling with Solaris fails.
 If this is the case here:
 
 Easiest way to make it work is to modify /kernel/drv/sd.conf and
 add an entry for your specific drive similar to this
 
 sd-config-list= WD  WD2000FYYG,power-condition:false, 
 SEAGATE ST2000NM0001,power-condition:false, SEAGATE
 ST32000644NS,power-condition:false, SEAGATE
 ST91000640SS,power-condition:false;
 
 Naturally you would have to find out the correct drive names. My
 latest version for a R710 with a MD1200 attached is:
 
 sd-config-list=SEAGATE ST2000NM0001,power-condition:false, 
 SEAGATE ST1000NM0001,power-condition:false, SEAGATE
 ST91000640SS,power-condition:false;
 
 
 Are you using the H200 with the base firmware or did you flash it
 to LSI IT? I am not sure that Solaris handles the H200 natively at
 all and if then it will not have direct drive access since the H200
 will only show virtual drives to Solaris/OI will it not?

They are Dell branded WD disks and I haven't done anything to the
HBA's firmware, so that's stock Dell as well. The drives are,
specifically are WD3001BKHG models. The firmware actually does expose
the disks unless they're part of a RAID group, so that should actually
work. I'll try te power-condition workaround you mentioned.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
On 05/16/2012 09:45 AM, Koopmann, Jan-Peter wrote:
 Hi,
 
 are those DELL branded WD disks? DELL tends to manipulate the firmware of
 the drives so that power handling with Solaris fails. If this is the case
 here:
 
 Easiest way to make it work is to modify /kernel/drv/sd.conf and add an
 entry
 for your specific drive similar to this
 
 sd-config-list= WD  WD2000FYYG,power-condition:false,
 SEAGATE ST2000NM0001,power-condition:false,
 SEAGATE ST32000644NS,power-condition:false,
 SEAGATE ST91000640SS,power-condition:false;
 
 Naturally you would have to find out the correct drive names. My latest
 version for a R710 with a MD1200 attached is:
 
 sd-config-list=SEAGATE ST2000NM0001,power-condition:false,
 SEAGATE ST1000NM0001,power-condition:false,
 SEAGATE ST91000640SS,power-condition:false;
 
 
 Are you using the H200 with the base firmware or did you flash it to LSI IT?
 I am not sure that Solaris handles the H200 natively at all and if then it
 will not have direct drive access since the H200 will only show virtual
 drives to Solaris/OI will it not?
 
 Kind regards,
JP
 
 PS: These are not my findings. Cudos to Sergei (tehc...@gmail.com) and
 Niklas Tungström.

One thing came up while trying this - I'm on a text install image
system, so my / is a ramdisk. Any ideas how I can change the sd.conf on
the USB disk or reload the driver configuration on the fly? I tried
looking for the file on the USB drive, but it isn't in the rootfs
(perhaps it's tucked away in some compressed filesystem image). Thanks!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
On 05/16/2012 10:17 AM, Koopmann, Jan-Peter wrote:
 
 
 One thing came up while trying this - I'm on a text install
 image system, so my / is a ramdisk. Any ideas how I can change
 the sd.conf on the USB disk or reload the driver configuration on
 the fly? I tried looking for the file on the USB drive, but it
 isn't in the rootfs (perhaps it's tucked away in some compressed
 filesystem image). Thanks!
 
 I am by no means a Solaris or OI guru and live from good advice of
 other people and Mr. Google. So sorry. I have no clueŠ

I got lucky at Googling after all and found the relevant command:

# update_drv -vf sd

The PERC H200 card had nothing to do with it, it was all in the crappy
firmware in the HDDs. Simply adding

sd-config-list = WD  WD3001BKHG,power-condition:false;

to my /kernel/drv/sd.conf (as you suggested) and reloading the driver
using update_drv solved it and I could then proceed with the
installation. The installer was even smart enough to install the
customized sd.conf into the new system, so no further tuning was
necessary.

Thanks for the pointers, you saved my bacon.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] MPxIO n00b question

2012-05-25 Thread Sašo Kiselkov
I'm currently trying to get a SuperMicro JBOD with dual SAS expander
chips running in MPxIO, but I'm a total amateur to this and would like
to ask about how to detect whether MPxIO is working (or not).

My SAS topology is:

 *) One LSI SAS2008-equipped HBA (running the latest IT firmware from
LSI) with two external ports.
 *) Two SAS cables running from the HBA to the SuperMicro JBOD, where
they enter the JBOD's rear backplane (which is equipped with two
LSI SAS expander chips).
 *) From the rear backplane, via two internal SAS cables to the front
backplane (also with two SAS expanders on it)
 *) The JBOD is populated with 45 2TB Toshiba SAS 7200rpm drives

The machine also has a PERC H700 for the boot media, configured into a
hardware RAID-1 (on which rpool resides).

Here is the relevant part from cfgadm -al for the MPxIO bits:

c5 scsi-sas connectedconfigured
unknown
c5::dsk/c5t5393D8CB4452d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8C90CF2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2A6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2AAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2BEd0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2C6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2E2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2F2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF5C6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF28Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF32Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF35Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF35Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF36Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF36Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF52Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF53Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF53Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF312d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF316d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF506d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF546d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C84F5Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C84FBAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C851EEd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852A6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852C2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852CAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852EAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C854BAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C854E2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C855AAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8509Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8520Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8528Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8530Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8531Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8557Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8558Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8560Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85106d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85222d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85246d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85366d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85636d0  disk connectedconfigured
unknown
c5::es/ses0ESI  connectedconfigured
unknown
c5::es/ses1ESI  connectedconfigured
unknown
c5::smp/expd0  smp  connectedconfigured
unknown
c5::smp/expd1  smp  connectedconfigured
unknown
c6 scsi-sas connectedconfigured
unknown
c6::dsk/c6t5393D8CB4453d0  disk connectedconfigured
unknown
c6::dsk/c6t5393E8C90CF3d0  disk connectedconfigured
unknown
c6::dsk/c6t5393E8CAF2A7d0  disk connectedconfigured
unknown
c6::dsk/c6t5393E8CAF2ABd0  disk 

Re: [zfs-discuss] MPxIO n00b question

2012-05-25 Thread Sašo Kiselkov
On 05/25/2012 07:35 PM, Jim Klimov wrote:
 Sorry I can't comment on MPxIO, except that I thought zfs could by
 itself discern two paths to the same drive, if only to protect
 against double-importing the disk into pool.

Unfortunately, it isn't the same thing. MPxIO provides redundant
signaling to the drives, independent of the storage/RAID layer above
it, so it does have its place (besides simply increasing throughput).

 I am not sure it is a good idea to use such low protection (raidz1)
 with large drives. At least, I was led to believe that after 2Tb in
 size raidz2 is preferable, and raidz3 is optimal due to long
 scrub/resilver times leading to large timeframes that a pool with
 an error is exposed to possible fatal errors (due to
 double-failures with single-protection).

I'd use lower protection if it were available :) The data on that
array is not very important, the primary design parameter is low cost
per MB. We're in a very demanding IO environment, we need large
quantities of high-throughput, high-IOPS storage, but we don't need
stellar reliability. If the pool gets corrupted due to unfortunate
double-drive failure, well, that's tough, but not unbearable (the pool
stores customer channel recordings for nPVR, so nothing critical really).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-25 Thread Sašo Kiselkov
On 05/25/2012 08:40 PM, Richard Elling wrote:
 See the soluion at https://www.illumos.org/issues/644 -- richard

Good Lord, that was it! It never occurred to me that the drives had a
say in this. Thanks a billion!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-27 Thread Sašo Kiselkov
On 05/07/2012 05:42 AM, Greg Mason wrote:
 I am currently trying to get two of these things running Illumian. I don't 
 have any particular performance requirements, so I'm thinking of using some 
 sort of supported hypervisor, (either RHEL and KVM or VMware ESXi) to get 
 around the driver support issues, and passing the disks through to an 
 Illumian guest.
 
 The H310 does indeed support pass-through (the non-raid mode), but one thing 
 to keep in mind is that I was only able to configure a single boot disk. I 
 configured the rear two drives into a hardware raid 1 and set the virtual 
 disk as the boot disk so that I can still boot the system if an OS disk fails.
 
 Once Illumos is better supported on the R720 and the PERC H310, I plan to get 
 rid of the hypervisor silliness and run Illumos on bare metal.

How about reflashing LSI firmware to the card? I read on Dell's spec
sheets that the card runs an LSISAS2008 chip, so chances are that
standard LSI firmware will work on it. I can send you all the required
bits to do the reflash, if you like.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.
 
 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.
 
 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...

If the drives show up at all, chances are you only need to work around
the power-up issue in Dell HDD firmware.

Here's what I had to do to get the drives going in my R515:
/kernel/drv/sd.conf

sd-config-list = SEAGATE ST3300657SS, power-condition:false,
 SEAGATE ST2000NM0001, power-condition:false;

(that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on
your drive model the strings might differ)

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 11:48 AM, Ian Collins wrote:
 On 05/28/12 08:55 PM, Sašo Kiselkov wrote:
 On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.

 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.

 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...
 If the drives show up at all, chances are you only need to work around
 the power-up issue in Dell HDD firmware.

 Here's what I had to do to get the drives going in my R515:
 /kernel/drv/sd.conf

 sd-config-list = SEAGATE ST3300657SS, power-condition:false,
   SEAGATE ST2000NM0001, power-condition:false;

 (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on
 your drive model the strings might differ)
 
 How would that work when the drive type is unknown (to format)?  I
 assumed if sd knows the type, so will format.

Simply take out the drive and have a look at the label.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 12:59 PM, Ian Collins wrote:
 On 05/28/12 10:53 PM, Sašo Kiselkov wrote:
 On 05/28/2012 11:48 AM, Ian Collins wrote:
 On 05/28/12 08:55 PM, Sašo Kiselkov wrote:
 On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.

 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.

 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...
 If the drives show up at all, chances are you only need to work around
 the power-up issue in Dell HDD firmware.

 Here's what I had to do to get the drives going in my R515:
 /kernel/drv/sd.conf

 sd-config-list = SEAGATE ST3300657SS, power-condition:false,
SEAGATE ST2000NM0001, power-condition:false;

 (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on
 your drive model the strings might differ)
 How would that work when the drive type is unknown (to format)?  I
 assumed if sd knows the type, so will format.
 Simply take out the drive and have a look at the label.
 
 Tricky when the machine is on a different continent!
 
 Joking aside, *I* know what the drive is, the OS as far as I can tell
 doesn't.

Can you have a look at your /var/adm/messages or dmesg to check whether
the OS is complaining about failed to power up on the relevant drives?
If yes, then the above fix should work for you, all you need to do is
determine the exact manufacturer and model to enter into sd.conf and
reload the driver via update_drv -vf sd.

Cheers
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 01:12 PM, Ian Collins wrote:
 On 05/28/12 11:01 PM, Sašo Kiselkov wrote:
 On 05/28/2012 12:59 PM, Ian Collins wrote:
 On 05/28/12 10:53 PM, Sašo Kiselkov wrote:
 On 05/28/2012 11:48 AM, Ian Collins wrote:
 On 05/28/12 08:55 PM, Sašo Kiselkov wrote:
 On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.

 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.

 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...
 If the drives show up at all, chances are you only need to work
 around
 the power-up issue in Dell HDD firmware.

 Here's what I had to do to get the drives going in my R515:
 /kernel/drv/sd.conf

 sd-config-list = SEAGATE ST3300657SS, power-condition:false,
 SEAGATE ST2000NM0001, power-condition:false;

 (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives,
 depending on
 your drive model the strings might differ)
 How would that work when the drive type is unknown (to format)?  I
 assumed if sd knows the type, so will format.
 Simply take out the drive and have a look at the label.
 Tricky when the machine is on a different continent!

 Joking aside, *I* know what the drive is, the OS as far as I can tell
 doesn't.
 Can you have a look at your /var/adm/messages or dmesg to check whether
 the OS is complaining about failed to power up on the relevant drives?
 If yes, then the above fix should work for you, all you need to do is
 determine the exact manufacturer and model to enter into sd.conf and
 reload the driver via update_drv -vf sd.
 
 Yes I do see that warning for the non-raid drives.
 
 The problem is I'm booting from a remote ISO image, so I can't alter
 /kernel/drv/sd.conf.
 
 I'll play more tomorrow, typing on a remote console inside an RDP
 session running in a VNC session on a virtual machine is interesting :)

I'm not sure about the Solaris 11 installer, but OpenIndiana's installer
runs from a ramdisk, so theoretically that should be doable. Other than
that you could do it by copying the contents of /kernel from the ISO
into a ramdrive and mounting that in place of /kernel and then issue the
reload command. In any case, you seem to be having exactly the same
issue as I did, so all you need to do is the above magic.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-30 Thread Sašo Kiselkov
On 05/25/2012 08:40 PM, Richard Elling wrote:
 See the soluion at https://www.illumos.org/issues/644
  -- richard

And predictably, I'm back with another n00b question regarding this
array. I've put a pair of LSI-9200-8e controllers in the server and
attached the cables to the enclosure to each of the HBAs. As a result
(why?) I'm getting some really strange behavior:

 * piss poor performance (around 5MB/s per disk tops)
 * fmd(1M) running one core at near 100% saturation each time something
   writes or reads from the pool
 * using fmstat I noticed that its the eft module receiving hundreds of
   fault reports every second
 * fmd is flooded by multipath failover ereports like:

...
May 29 21:11:44.9408 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9423 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.8474 ereport.io.scsi.cmd.disk.recovered
May 29 21:11:44.9455 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9457 ereport.io.scsi.cmd.disk.dev.rqs.derr
May 29 21:11:44.9462 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9527 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9535 ereport.io.scsi.cmd.disk.dev.rqs.derr
May 29 21:11:44.6362 ereport.io.scsi.cmd.disk.recovered
...



I suspect that multipath is something not exactly very happy with my
Toshiba disks, but I have no idea what to do to make it work at least
somehow acceptably. I tried messing with scsi_vhci.conf to try and set
load-balance=none, change the scsi-vhci-failover-override for the
Toshiba disks to f_asym_lsi, flashing the latest as well as old firmware
in the cards, reseating them to other PCI-e slots, removing one cable
and even removing one whole HBA, unloading the eft fmd module etc, but
nothing helped so far and I'm sort of out of ideas. Anybody else got an
idea on what I might try?

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-30 Thread Sašo Kiselkov
On 05/30/2012 10:53 PM, Richard Elling wrote:
 On May 30, 2012, at 1:07 PM, Sašo Kiselkov wrote:
 
 On 05/25/2012 08:40 PM, Richard Elling wrote:
 See the soluion at https://www.illumos.org/issues/644
 -- richard

 And predictably, I'm back with another n00b question regarding this
 array. I've put a pair of LSI-9200-8e controllers in the server and
 attached the cables to the enclosure to each of the HBAs. As a result
 (why?) I'm getting some really strange behavior:

 * piss poor performance (around 5MB/s per disk tops)
 * fmd(1M) running one core at near 100% saturation each time something
   writes or reads from the pool
 * using fmstat I noticed that its the eft module receiving hundreds of
   fault reports every second
 * fmd is flooded by multipath failover ereports like:

 ...
 May 29 21:11:44.9408 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9423 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.8474 ereport.io.scsi.cmd.disk.recovered
 May 29 21:11:44.9455 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9457 ereport.io.scsi.cmd.disk.dev.rqs.derr
 May 29 21:11:44.9462 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9527 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9535 ereport.io.scsi.cmd.disk.dev.rqs.derr
 May 29 21:11:44.6362 ereport.io.scsi.cmd.disk.recovered
 ...



 I suspect that multipath is something not exactly very happy with my
 Toshiba disks, but I have no idea what to do to make it work at least
 somehow acceptably. I tried messing with scsi_vhci.conf to try and set
 load-balance=none, change the scsi-vhci-failover-override for the
 Toshiba disks to f_asym_lsi, flashing the latest as well as old firmware
 in the cards, reseating them to other PCI-e slots, removing one cable
 and even removing one whole HBA, unloading the eft fmd module etc, but
 nothing helped so far and I'm sort of out of ideas. Anybody else got an
 idea on what I might try?
 
 Those ereports are consistent with faulty cabling. You can trace all of the
 cables and errors using tools like lsiutil, sg_logs, kstats, etc. 
 Unfortunately,
 it is not really possible to get into this level of detail over email, and it 
 can
 consume many hours.
  -- richard

That's actually a pretty good piece of information for me! I will try
changing my cabling to see if I can get the errors to go away. Thanks
again for the suggestions!

Cheers
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-30 Thread Sašo Kiselkov
On 05/30/2012 10:53 PM, Richard Elling wrote:
 Those ereports are consistent with faulty cabling. You can trace all of the
 cables and errors using tools like lsiutil, sg_logs, kstats, etc. 
 Unfortunately,
 it is not really possible to get into this level of detail over email, and it 
 can
 consume many hours.
  -- richard

And it turns out you were right. Looking at errors using iostat -E while
manipulating the path taken by the data using mpathadm clearly shows
that one of the paths is faulty. Thanks again for pointing me in the
right direction!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached
to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two
LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm
occasionally seeing a storm of xcalls on one of the 32 VCPUs (10
xcalls a second). The machine is pretty much idle, only receiving a
bunch of multicast video streams and dumping them to the drives (at a
rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of
xcalls that completely eat one of the CPUs, so the mpstat line for the
CPU looks like:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 310   0 102191 1000000 00 100
 0   0

100% busy in the system processing cross-calls. When I tried dtracing
this issue, I found that this is the most likely culprit:

dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}'
   unix`xc_call+0x46
   unix`hat_tlb_inval+0x283
   unix`x86pte_inval+0xaa
   unix`hat_pte_unmap+0xed
   unix`hat_unload_callback+0x193
   unix`hat_unload+0x41
   unix`segkmem_free_vn+0x6f
   unix`segkmem_zio_free+0x27
   genunix`vmem_xfree+0x104
   genunix`vmem_free+0x29
   genunix`kmem_slab_destroy+0x87
   genunix`kmem_slab_free+0x2bb
   genunix`kmem_magazine_destroy+0x39a
   genunix`kmem_depot_ws_reap+0x66
   genunix`taskq_thread+0x285
   unix`thread_start+0x8
3221701

This happens in the sched (pid 0) process. My fsstat one looks like this:

# fsstat /content 1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   setops   ops   ops bytes   ops bytes
0 0 0   664 0952 0 0 0   664 38.0M /content
0 0 0   658 0935 0 0 0   656 38.6M /content
0 0 0   660 0946 0 0 0   659 37.8M /content
0 0 0   677 0969 0 0 0   676 38.5M /content

What's even more puzzling is that this happens apparently entirely
because of some factor other than userland, since I see no changes to
CPU usage of processes in prstat(1M) when this xcall storm happens, only
an increase of loadavg of +1.00 (the busy CPU).

I Googled and found that
http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html
seems to have been an issue identical to mine, however, it remains
unresolved at that time and it worries me about putting this kind of
machine into production use.

Could some ZFS guru please tell me what's going on in segkmem_zio_free?
When I disable the writers to the /content filesystem, this issue goes
away, so it has obviously something to do with disk IO. Thanks!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
On 06/06/2012 04:55 PM, Richard Elling wrote:
 On Jun 6, 2012, at 12:48 AM, Sašo Kiselkov wrote:
 
 So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached
 to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two
 LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm
 occasionally seeing a storm of xcalls on one of the 32 VCPUs (10
 xcalls a second).
 
 That isn't much of a storm, I've seen  1M xcalls in some cases...

Well it does make one of the cores 100% busy for around 10-15 seconds,
so it is processing at the maximum rate the core can do it. I'd call
that a sign of something bad(tm) going on.

 The machine is pretty much idle, only receiving a
 bunch of multicast video streams and dumping them to the drives (at a
 rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of
 xcalls that completely eat one of the CPUs, so the mpstat line for the
 CPU looks like:

 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 310   0 102191 1000000 00 100
 0   0

 100% busy in the system processing cross-calls. When I tried dtracing
 this issue, I found that this is the most likely culprit:

 dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}'
   unix`xc_call+0x46
   unix`hat_tlb_inval+0x283
   unix`x86pte_inval+0xaa
   unix`hat_pte_unmap+0xed
   unix`hat_unload_callback+0x193
   unix`hat_unload+0x41
   unix`segkmem_free_vn+0x6f
   unix`segkmem_zio_free+0x27
   genunix`vmem_xfree+0x104
   genunix`vmem_free+0x29
   genunix`kmem_slab_destroy+0x87
   genunix`kmem_slab_free+0x2bb
   genunix`kmem_magazine_destroy+0x39a
   genunix`kmem_depot_ws_reap+0x66
   genunix`taskq_thread+0x285
   unix`thread_start+0x8
 3221701

 This happens in the sched (pid 0) process. My fsstat one looks like this:

 # fsstat /content 1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   setops   ops   ops bytes   ops bytes
0 0 0   664 0952 0 0 0   664 38.0M /content
0 0 0   658 0935 0 0 0   656 38.6M /content
0 0 0   660 0946 0 0 0   659 37.8M /content
0 0 0   677 0969 0 0 0   676 38.5M /content

 What's even more puzzling is that this happens apparently entirely
 because of some factor other than userland, since I see no changes to
 CPU usage of processes in prstat(1M) when this xcall storm happens, only
 an increase of loadavg of +1.00 (the busy CPU).
 
 What exactly is the workload doing?

As I wrote above, just receiving multicast video streams and writing
them to disk files, nothing else. The fsstat lines above show that -
pure write load.

 Local I/O, iSCSI, NFS, or CIFS?

Purely local I/O via the two LSI SAS controllers, nothing else.

 I Googled and found that
 http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html
 seems to have been an issue identical to mine, however, it remains
 unresolved at that time and it worries me about putting this kind of
 machine into production use.

 Could some ZFS guru please tell me what's going on in segkmem_zio_free?
 
 It is freeing memory.

Yes, but why is this causing a ton of cross-calls?

 When I disable the writers to the /content filesystem, this issue goes
 away, so it has obviously something to do with disk IO. Thanks!
 
 Not directly related to disk I/O bandwidth. Can be directly related to other
 use, such as deletions -- something that causes frees.

When I'm not writing to disk it doesn't happen, so my guess that it
indeed has something to do with (perhaps) ZFS freeing txg buffers or
something...

 Depending on the cause, there can be some tuning that applies for large
 memory machines, where large is = 96 MB.
  -- richard

I'll try and load the machine with dd(1) to the max to see if access
patterns of my software have something to do with it.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
On 06/06/2012 05:01 PM, Sašo Kiselkov wrote:
 I'll try and load the machine with dd(1) to the max to see if access
 patterns of my software have something to do with it.

Tried and tested, any and all write I/O to the pool causes this xcall
storm issue, writing more data to it only exacerbates it (i.e. it occurs
more often). I still get storms of over 100k xcalls completely draining
one CPU core, but now they happen in 20-30s intervals rather than every
1-2 minutes. Writing to the rpool, however, does not, so I suspect it
has something to do with the MPxIO and how ZFS is pumping data into the
twin LSI 9200 controllers. Each is attached to a different CPU I/O
bridge (since the system has two Opterons, it has two I/O bridges, each
handling roughly half of the PCI-e links). I did this in the hope of
improving performance (since the HT links to the I/O bridges will be
more evenly loaded). Any idea of this might be the cause of this issue?

The whole system diagram is:

CPU --(ht)-- IOB --(pcie)-- LSI 9200 --(sas)-,
 |\
(ht)   == JBOD
 |/
CPU --(ht)-- IOB --(pcie)-- LSI 9200 --(sas)-'

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
On 06/06/2012 09:43 PM, Jim Mauro wrote:
 
 I can't help but be curious about something, which perhaps you verified but
 did not post.
 
 What the data here shows is;
 - CPU 31 is buried in the kernel (100% sys).
 - CPU 31 is handling a moderate-to-high rate of xcalls.
 
 What the data does not prove empirically is that the 100% sys time of
 CPU 31 is in xcall handling.
 
 What's the hot stack when this occurs and you run this;
 
 dtrace -n 'profile-997hz /cpu == 31/ { @[stack()] = count(); }'
 

Thanks for pointing this out. I ran the probe you specified and attached
are the results (I had to chase the xcalls around a bit, because they
were jumping around cores as I was trying to insert the probes). As I
suspected, the most numerous stack trace is the one which causes cross
calls because of the segkmem_zio_free+0x27 code path. While this was
going on, I was getting between 80k and 300k xcalls on the core in question.

The next most common stack was the one ending in mach_cpu_idle and then,
so I'm not sure why the CPU reported 100% busy (perhaps the xcalls were
very expensive on CPU time compared with the 1273 idle's).

Cheers,
--
Saso


xc_call.txt.bz2
Description: application/bzip
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
Seems the problem is somewhat more egregious than I thought. The xcall
storm causes my network drivers to stop receiving IP multicast packets
and subsequently my recording applications record bad data, so
ultimately, this kind of isn't workable... I need to somehow resolve
this... I'm running four on-board Broadcom NICs in an LACP
aggregation. Any ideas on why this might be a side-effect? I'm really
kind of out of ideas here...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 03:57 PM, Sašo Kiselkov wrote:
 Seems the problem is somewhat more egregious than I thought. The xcall
 storm causes my network drivers to stop receiving IP multicast packets
 and subsequently my recording applications record bad data, so
 ultimately, this kind of isn't workable... I need to somehow resolve
 this... I'm running four on-board Broadcom NICs in an LACP
 aggregation. Any ideas on why this might be a side-effect? I'm really
 kind of out of ideas here...
 
 Cheers,
 --
 Saso

Just as another datapoint, though I'm not sure if it's going to be much
use, is that I found (via arcstat.pl) that the storms always start
happen when ARC downsizing starts. E.g. I would see the following in
./arcstat.pl 1:

Time  readdmis  dm%  pmis  pm%  mmis  mm%  arcsz c
16:29:4521   00 00 00   111G  111G
16:29:46 0   00 00 00   111G  111G
16:29:47 1   00 00 00   111G  111G
16:29:48 0   00 00 00   111G  111G
16:29:495K   00 00 00   111G  111G
  (this is where the problem starts)
16:29:5036   00 00 00   109G  107G
16:29:5151   00 00 00   107G  107G
16:29:5210   00 00 00   107G  107G
16:29:53   148   00 00 00   107G  107G
16:29:545K   00 00 00   107G  107G
  (and after a while, around 10-15 seconds, it stops)

(I omitted the miss and miss% columns to make the rows fit).

During the time, the network stack is dropping input IP multicast UDP
packets like crazy, so I see my network input drop by about 30-40%.
Truly strange behavior...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:21 PM, Matt Breitbach wrote:
 I saw this _exact_ problem after I bumped ram from 48GB to 192GB.  Low
 memory pressure seemed to be the cuplrit.  Happened usually during storage
 vmotions or something like that which effectively nullified the data in the
 ARC (sometimes 50GB of data would be purged from the ARC).  The system was
 so busy that it would drop 10Gbit LACP portchannels from our Nexus 5k stack.
 I never got a good solution to this other than to set arc_min_c to something
 that was close to what I wanted the system to use - I settled on setting it
 at ~160GB.  It still dropped the arcsz, but it didn't try to adjust arc_c
 and resulted in significantly fewer xcalls.

Hmm, how do I do that? I don't have that kind of symbol in the kernel.
I'm running OpenIndiana build 151a. My system indeed runs at low memory
pressure, I'm simply running a bunch of writers writing files linearly
with data they received IP/UDP multicast sockets.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:37 PM, Roch Bourbonnais wrote:
 
 So the xcall are necessary part of memory reclaiming, when one needs to tear 
 down the TLB entry mapping the physical memory (which can from here on be 
 repurposed).
 So the xcall are just part of this. Should not cause trouble, but they do. 
 They consume a cpu for some time.
 
 That in turn can cause infrequent latency bubble on the network. A certain 
 root cause of these latency bubble is that network thread are bound by 
 default and
 if the xcall storm ends up on the CPU that the network thread is bound to, it 
 will wait for the storm to pass.

I understand, but the xcall storm settles only eats up a single core out
of a total of 32, plus it's not a single specific one, it tends to
change, so what are the odds of hitting the same core as the one on
which the mac thread is running?

 So try unbinding the mac threads; it may help you here.

How do I do that? All I can find on interrupt fencing and the like is to
simply set certain processors to no-intr, which moves all of the
interrupts and it doesn't prevent the xcall storm choosing to affect
these CPUs either...

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 06:06 PM, Jim Mauro wrote:
 

 So try unbinding the mac threads; it may help you here.

 How do I do that? All I can find on interrupt fencing and the like is to
 simply set certain processors to no-intr, which moves all of the
 interrupts and it doesn't prevent the xcall storm choosing to affect
 these CPUs either…
 
 In /etc/system:
 
 set mac:mac_soft_ring_thread_bind=0
 set mac:mac_srs_thread_bind=0
 
 Reboot required. Verify after reboot with mdb;
 
 echo mac_soft_ring_thread_bind/D | mdb -k

Trying that right now... thanks!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote:
 find where your nics are bound too
 
 mdb -k
 ::interrupts
 
 create a processor set including those cpus [ so just the nic code will
 run there ]
 
 andy

Tried and didn't help, unfortunately. I'm still seeing drops. What's
even funnier is that I'm seeing drops when the machine is sync'ing the
txg to the zpool. So looking at a little UDP receiver I can see the
following input stream bandwidth (the stream is constant bitrate, so
this shouldn't happen):

4.396151 Mbit/s   - drop
5.217205 Mbit/s
5.144323 Mbit/s
5.150227 Mbit/s
5.144150 Mbit/s
4.663824 Mbit/s   - drop
5.178603 Mbit/s
5.148681 Mbit/s
5.153835 Mbit/s
5.141116 Mbit/s
4.532479 Mbit/s   - drop
5.197381 Mbit/s
5.158436 Mbit/s
5.141881 Mbit/s
5.145433 Mbit/s
4.605852 Mbit/s   - drop
5.183006 Mbit/s
5.150526 Mbit/s
5.149324 Mbit/s
5.142306 Mbit/s
4.749443 Mbit/s   - drop

(txg timeout on my system is the default 5s)

It isn't just a slight delay in the arrival of the packets, because then
I should be seeing a rebound on the bitrate, sort of like this:

 ^ |-, ,^, ,^-, ,^
 B |  v   vv
   |
   +--
t -

Instead, what I'm seeing is simply:

 ^ |-, ,-, ,--, ,-
 B |  v   vv
   |
   +--
t -

(The missing spikes after the drops means that there were lost packets
on the NIC.)

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 07:19 PM, Roch Bourbonnais wrote:
 
 Try with this /etc/system tunings :
 
 set mac:mac_soft_ring_thread_bind=0 set mac:mac_srs_thread_bind=0 
 set zfs:zio_taskq_batch_pct=50
 

Thanks for the recommendations, I'll try and see whether it helps, but
this is going to take me a while (especially since the reboot means
I'll have a clear ARC and need to record up again around 120G of data,
which takes a while to accumulate).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks

2012-06-15 Thread Sašo Kiselkov
On 06/15/2012 03:35 PM, Johannes Totz wrote:
 On 15/06/2012 13:22, Sašo Kiselkov wrote:
 On 06/15/2012 02:14 PM, Hans J Albertsson wrote:
 I've got my root pool on a mirror on 2 512 byte blocksize disks. I
 want to move the root pool to two 2 TB disks with 4k blocks. The
 server only has room for two disks. I do have an esata connector,
 though, and a suitable external cabinet for connecting one extra disk.

 How would I go about migrating/expanding the root pool to the
 larger disks so I can then use the larger disks for booting?
 I have no extra machine to use.

 Suppose we call the disks like so:

   A, B: your old 512-block drives
   X, Y: your new 2TB drives

 The easiest way would be to simply:

 1) zpool set autoexpand=on rpool
 2) offline the A drive
 3) physically replace it with the X drive
 4) do a zpool replace on it and wait for it to resilver
 
 When sector size differs, attaching it is going to fail (at least on fbsd).
 You might not get around a send-receive cycle...

Jim Klimov has already posted a way better guide, which rebuilds the
pool using the old one's data, so yeah, the replace route I recommended
here is rendered moot.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-17 Thread Sašo Kiselkov
On 06/13/2012 03:43 PM, Roch wrote:
 
 Sašo Kiselkov writes:
   On 06/12/2012 05:37 PM, Roch Bourbonnais wrote:

So the xcall are necessary part of memory reclaiming, when one needs to 
 tear down the TLB entry mapping the physical memory (which can from here on 
 be repurposed).
So the xcall are just part of this. Should not cause trouble, but they 
 do. They consume a cpu for some time.

That in turn can cause infrequent latency bubble on the network. A 
 certain root cause of these latency bubble is that network thread are bound 
 by default and
if the xcall storm ends up on the CPU that the network thread is bound 
 to, it will wait for the storm to pass.
   
   I understand, but the xcall storm settles only eats up a single core out
   of a total of 32, plus it's not a single specific one, it tends to
   change, so what are the odds of hitting the same core as the one on
   which the mac thread is running?
   
 
 That's easy :-) : 1/32 each time it needs to run. So depending on how often 
 it runs (which depends on how
 much churn there is in the ARC) and how often you see the latency bubbles, 
 that may or may
 not be it.
 
 What is zio_taskq_batch_pct on your system ? That is another storm bit of 
 code which
 causes bubble. Setting it down to 50 (versus an older default of 100) should 
 help if it's
 not done already.
 
 -r

So I tried all of the suggestions above (mac unbinding, zio_taskq
tuning) and none helped. I'm beginning to suspect it has something to do
with the networking cards. When I try and snoop filtered traffic from
one interface into a file (snoop -o /tmp/dump -rd vlan935 host
a.b.c.d), my multicast reception throughput plummets to about 1/3 of
the original.

I'm running a link-aggregation of 4 on-board Broadcom NICs:

# dladm show-aggr -x
LINK PORT SPEED DUPLEX   STATE ADDRESSPORTSTATE
aggr0--   1000Mb fullupd0:67:e5:fc:bd:38  --
 bnx1 1000Mb fullupd0:67:e5:fc:bd:38  attached
 bnx2 1000Mb fullupd0:67:e5:fc:bd:3a  attached
 bnx3 1000Mb fullupd0:67:e5:fc:bd:3c  attached
 bnx0 1000Mb fullupd0:67:e5:fc:bd:36  attached

# dladm show-vlan
LINKVID  OVER FLAGS
vlan49  49   aggr0-
vlan934 934  aggr0-
vlan935 935  aggr0-

Normally, I'm getting around 46MB/s on vlan935, however, once I run any
snoop command which puts the network interfaces into promisc mode, my
throughput plummets to around 20MB/s. During that I can see context
switches skyrocket on 4 CPU cores and them being around 75% busy. Now I
understand that snoop has some probe effect, but this is definitely too
large. I've never seen this kind of bad behavior before on any of my
other Solaris systems (with similar load).

Are there any tunings I can make to my network to track down the issue?
My module for bnx is:

# modinfo | grep bnx
169 f80a7000  63ba0 197   1  bnx (Broadcom NXII GbE 6.0.1)

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-18 Thread Sašo Kiselkov
On 06/18/2012 12:05 AM, Richard Elling wrote:
 You might try some of the troubleshooting techniques described in Chapter 5 
 of the DTtrace book by Brendan Gregg and Jim Mauro. It is not clear from your
 description that you are seeing the same symptoms, but the technique should
 apply.
  -- richard

Thanks for the advice, I'll try it. In the mean time, I'm beginning to
suspect I'm hitting some PCI-e issue on the Dell R715 machine. Looking at

# mdb -k
::interrupts
IRQ  Vect IPL BusTrg Type   CPU Share APIC/INT# ISR(s)
.[snip]
91   0x82 7   PCIEdg MSI5   1 - pcieb_intr_handler
.[snip].

In mpstat I can see that during normal operation, CPU 5 is nearly floored:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  50   00   5120 105400  8700 00  93   0   7

Then, when anything hits which disturbs the PCI-e bus (e.g. a txg flush
or the xcall storm), the CPU goes to 100% utilization and my networking
throughput drops accordingly. The issue can be softened by lowering the
input bandwidth from ~46MB/s to below 20MB/s - at that point I'm getting
only about 10% utilization on the core in question and no xcall storm or
txg flush can influence my network (though I do see the CPU get about
70% busy during the process, but still enough left to avoid packet loss).

So it seems, I'm hitting some hardware design issue, or something...
I'll try moving my network card to the second PCI-e I/O bridge tomorrow
(which seems to be bound to CPU 6).

Any other ideas on what I might try to get the PCI-e I/O bridge
bandwidth back? Or how to fight the starvation of the CPU by other
activities in the system? (xcalls and/or txg flushes) I already tried
putting the CPUs in question into an empty processor set, but that isn't
enough, it seems.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-19 Thread Sašo Kiselkov
On 06/19/2012 11:05 AM, Sašo Kiselkov wrote:
 On 06/18/2012 07:50 PM, Roch wrote:

 Are we hitting :
  7167903 Configuring VLANs results in single threaded soft ring fanout
 
 Confirmed, it is definitely this.

Hold the phone, I just tried unconfiguring all of the VLANs in the
system and went to pure interfaces and it didn't help. So while the
issue stems from the soft ring fanout, it's probably not caused by
VLANs. Thanks for the pointers anyway, though.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] New fast hash algorithm - is it needed?

2012-07-10 Thread Sašo Kiselkov
Hi guys,

I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
implementation to supplant the currently utilized sha256. On modern
64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
slower than many of the SHA-3 candidates, so I went out and did some
testing (details attached) on a possible new hash algorithm that might
improve on this situation.

However, before I start out on a pointless endeavor, I wanted to probe
the field of ZFS users, especially those using dedup, on whether their
workloads would benefit from a faster hash algorithm (and hence, lower
CPU utilization). Developments of late have suggested to me three
possible candidates:

 * SHA-512: simplest to implement (since the code is already in the
   kernel) and provides a modest performance boost of around 60%.

 * Skein-512: overall fastest of the SHA-3 finalists and much faster
   than SHA-512 (around 120-150% faster than the current sha256).

 * Edon-R-512: probably the fastest general purpose hash algorithm I've
   ever seen (upward of 300% speedup over sha256) , but might have
   potential security problems (though I don't think this is of any
   relevance to ZFS, as it doesn't use the hash for any kind of security
   purposes, but only for data integrity  dedup).

My testing procedure: nothing sophisticated, I took the implementation
of sha256 from the Illumos kernel and simply ran it on a dedicated
psrset (where possible with a whole CPU dedicated, even if only to a
single thread) - I tested both the generic C implementation and the
Intel assembly implementation. The Skein and Edon-R implementations are
in C optimized for 64-bit architectures from the respective authors (the
most up to date versions I could find). All code has been compiled using
GCC 3.4.3 from the repos (the same that can be used for building
Illumos). Sadly, I don't have access to Sun Studio.

Cheers,
--
Saso
Hash preformances on 10 GB of data
gcc (GCC) 3.4.3 (csl-sol210-3_4-20050802)
CFLAGS: -O3 -fomit-frame-pointer -m64

MACHINE #1
CPU: dual AMD Opteron 4234
Options: single thread on no-intr whole-CPU psrset

Algorithm   Result  Improvement
sha256 (ASM)21.19 cycles/byte   (baseline)
sha256 (C)  27.66 cycles/byte   -23.34%

sha512 (ASM)13.48 cycles/byte   57.20%
sha512 (C)  17.35 cycles/byte   22.13%

Skein-512 (C)   8.95 cycles/byte136.76%
Edon-R-512 (C)  4.94 cycles/byte328.94%

MACHINE #2
CPU: single AMD Athlon II Neo N36L
Options: single thread on no-intr 1-core psrset

Algorithm   Result  Improvement
sha256 (ASM)15.68 cycles/byte   (baseline)
sha256 (C)  18.81 cycles/byte   -16.64%

sha512 (ASM)9.95 cycles/byte57.59%
sha512 (C)  11.84 cycles/byte   32.43%

Skein-512 (C)   6.25 cycles/byte150.88%
Edon-R-512 (C)  3.66 cycles/byte328.42%

MACHINE #3
CPU: dual Intel Xeon E5645
Options: single thread on no-intr whole-CPU psrset

Algorithm   Result  Improvement
sha256 (ASM)15.49 cycles/byte   (baseline)
sha256 (C)  17.90 cycles/byte   -13.46%

sha512 (ASM)9.88 cycles/byte56.78%
sha512 (C)  11.44 cycles/byte   35.40%

Skein-512 (C)   6.88 cycles/byte125.15%
Edon-R-512 (C)  3.35 cycles/byte362.39%

MACHINE #4
CPU: single Intel Xeon E5405
Options: single thread on no-intr 1-core psrset

Algorithm   Result  Improvement
sha256 (ASM)17.45 cycles/byte   (baseline)
sha256 (C)  18.34 cycles/byte   -4.85%

sha512 (ASM)10.24 cycles/byte   70.41%
sha512 (C)  11.72 cycles/byte   48.90%

Skein-512 (C)   7.32 cycles/byte138.39%
Edon-R-512 (C)  3.86 cycles/byte352.07%

MACHINE #5
CPU: dual Intel Xeon E5450
Options: single thread on no-intr whole-CPU psrset

Algorithm   Result  Improvement
sha256 (ASM)16.43 cycles/byte   (baseline)
sha256 (C)  18.50 cycles/byte   -11.19%

sha512 (ASM)10.37 cycles/byte   58.44%
sha512 (C)  11.85 cycles/byte   38.65%

Skein-512 (C)   7.38 cycles/byte122.63%
Edon-R-512 (C)  3.88 cycles/byte323.45%

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 02:18 AM, John Martin wrote:
 On 07/10/12 19:56, Sašo Kiselkov wrote:
 Hi guys,

 I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
 implementation to supplant the currently utilized sha256. On modern
 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
 slower than many of the SHA-3 candidates, so I went out and did some
 testing (details attached) on a possible new hash algorithm that might
 improve on this situation.
 
 Is the intent to store the 512 bit hash or truncate to 256 bit?
 

The intent is to truncate. I know ZFS can only store up to 32 bytes in
the checksum.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:20 AM, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

 I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
 implementation to supplant the currently utilized sha256. On modern
 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
 slower than many of the SHA-3 candidates, so I went out and did some
 testing (details attached) on a possible new hash algorithm that might
 improve on this situation.
 
 As coincidence would have it, I recently benchmarked md5 hashing and AES 
 encryption on systems with and without AES-NI.  Theoretically, hashing should 
 be much faster because it's asymmetric, while symmetric encryption has much 
 less speed potential.  I found md5 could hash at most several hundred MB/sec, 
 and AES was about half to quarter of that speed ...  Which is consistent with 
 the theory.  But if I had AES-NI, then AES was about 1.1 GB/sec.  Which means 
 we have much *much* more speed potential available untapped in terms of 
 hashing.

MD5 is a painfully slow hash compared to the SHA-3 candidates, or even
SHA-512. The candidates I tested produced the following throughputs (a
simple reversal of the cycles/byte metric for each CPU):

Opteron 4234 (3.1 GHz):
  Skein-512: 355 MB/s
  Edon-R: 643 MB/s

AMD Athlon II Neo N36L (1.3 GHz):
  Skein-512: 213 MB/s
  Edon-R: 364 MB/s

Intel Xeon E5645 (2.4 GHz):
  Skein-512: 357 MB/s
  Edon-R: 734 MB/s

Intel Xeon E5405 (2.0 GHz):
  Skein-512: 280 MB/s
  Edon-R: 531 MB/s

Intel Xeon E5450 (3.0 GHz):
  Skein-512: 416 MB/s
  Edon-R: 792 MB/s

Keep in mind that this is single-threaded on a pure-C implementation.
During my tests I used GCC 3.4.3 in order to be able to assess speed
improvements should the code be folded into Illumos (since that's one
compiler Illumos can be built with), but GCC 3.4.3 is seriously
stone-age. Compiling with GCC 4.6.2 I got a further speed boost of
around 10-20%, so even in pure C, Edon-R is probably able to breathe
down the neck of the AES-NI optimized implementation you mentioned.

 Now, when you consider that a single disk typically is able to sustain 
 1.0Gbit (128 MB) per second, it means, very quickly the CPU can become the 
 bottleneck for sustained disk reads in a large raid system.  I think a lot of 
 the time, people with a bunch of disks in a raid configuration are able to 
 neglect the CPU load, just because they're using fletcher.

Yes, that's exactly what I'm getting at. It would be great to have a
hash that you could enable with significantly more peace of mind than
sha256 - with sha256, you always need to keep in mind that the hashing
is going to be super-expensive (even for reads). My testing with a
small JBOD from Supermicro showed that I was easily able to exceed 2
GB/s of reads off of just 45 7k2 SAS drives.

 Of the SHA3 finalists, I was pleased to see that only one of them was based 
 on AES-NI, and the others are actually faster.  So in vague hand-waving 
 terms, yes I believe the stronger  faster hash algorithm, in time will be a 
 big win for zfs performance.  But only in situations where people have a 
 sufficiently large number of disks and sufficiently high expectation for IO 
 performance.

My whole reason for starting this exercise is that RAM is getting dirt
cheap nowadays, so a reasonably large and fast dedup setup can be had
for relatively little money. However, I think that the sha256 hash is
really ill suited to this application and even if it isn't a critical
issue, I think it's really inexcusable we are using something worse than
the best of breed here (especially considering how ZFS was always
designed to be easily extensible with new algorithms).

 CPU's are not getting much faster.  But IO is definitely getting faster.  
 It's best to keep ahead of that curve.

As I said above, RAM is getting cheaper much faster than CPU performance
is. Nowadays you can get 128 GB of server-grade RAM for around $1000, so
equipping a machine with half a terabyte of RAM or more is getting
commonplace. By a first degree approximation (assuming 200 B per block
in the DDT) this would allow one to store upward of 80TB of unique 128K
blocks in 128 GB of RAM - that's easily above 100 TB of attached raw
storage, so we're looking at things like this:

http://dataonstorage.com/dataon-products/6g-sas-jbod/dns-1660-4u-60-bay-6g-35inch-sassata-jbod.html

These things come with eight 4-wide SAS 2.0 ports and enough bandwidth
to saturate a QDR InfiniBand port. Another thing to consider are SSDs. A
single 2U server with eight or even sixteen 2.5'' SATA-III SSDs can
achieve even higher throughput. So I fully agree with you, we need to
stay ahead of the curve, however, I think the curve is much closer than
we think!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 09:58 AM, Ferenc-Levente Juhos wrote:
 Hello all,
 
 what about the fletcher2 and fletcher4 algorithms? According to the zfs man
 page on oracle, fletcher4 is the current default.
 Shouldn't the fletcher algorithms be much faster then any of the SHA
 algorithms?
 On Wed, Jul 11, 2012 at 9:19 AM, Sašo Kiselkov skiselkov...@gmail.comwrote:

Fletcher is a checksum, not a hash. It can and often will produce
collisions, so you need to set your dedup to verify (do a bit-by-bit
comparison prior to deduplication) which can result in significant write
amplification (every write is turned into a read and potentially another
write in case verify finds the blocks are different). With hashes, you
can leave verify off, since hashes are extremely unlikely (~10^-77) to
produce collisions.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:41 AM, Ferenc-Levente Juhos wrote:
 I was under the impression that the hash (or checksum) used for data
 integrity is the same as the one used for deduplication,
 but now I see that they are different.

They are the same in use, i.e. once you switch dedup on, that implies
checksum=sha256. However, if you want to you, you can force ZFS to use
fletcher even with dedup by setting dedup=verify and checksum=fletcher4
(setting dedup=on with fletcher4 is not advised due to fletcher's
possibility of producing collisions and subsequent silent corruption).
It's also possible to set dedup=verify with checksum=sha256,
however, that makes little sense (as the chances of getting a random
hash collision are essentially nil).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:47 AM, Joerg Schilling wrote:
 Sa??o Kiselkov skiselkov...@gmail.com wrote:
 
 write in case verify finds the blocks are different). With hashes, you
 can leave verify off, since hashes are extremely unlikely (~10^-77) to
 produce collisions.
 
 This is how a lottery works. the chance is low but some people still win.
 
 q~A

You do realize that the age of the universe is only on the order of
around 10^18 seconds, do you? Even if you had a trillion CPUs each
chugging along at 3.0 GHz for all this time, the number of processor
cycles you will have executed cumulatively is only on the order 10^40,
still 37 orders of magnitude lower than the chance for a random hash
collision.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 11:02 AM, Darren J Moffat wrote:
 On 07/11/12 00:56, Sašo Kiselkov wrote:
   * SHA-512: simplest to implement (since the code is already in the
 kernel) and provides a modest performance boost of around 60%.
 
 FIPS 180-4 introduces SHA-512/t support and explicitly SHA-512/256.
 
 http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf
 
 Note this is NOT a simple truncation of SHA-512 since when using
 SHA-512/t the initial value H(0) is different.

Yes, I know that. In my original post I was only trying to probe the
landscape for whether there is interest in this in the community. Of
course for the implementation I'd go with the standardized truncation
function. Skein-512 already includes a truncation function. For Edon-R,
I'd have to devise one (probably based around SHA-512/256 or Skein).

 See sections 5.3.6.2 and 6.7.
 
 I recommend the checksum value for this be
 checksum=sha512/256
 
 A / in the value doesn't cause any problems and it is the official NIST
 name of that hash.
 
 With the internal enum being: ZIO_CHECKSUM_SHA512_256
 
 CR 7020616 already exists for adding this in Oracle Solaris.

Okay, if I'll implement it identically in Illumos then. However, given
that I plan to use featureflags to advertise the feature to the outside
world, I doubt interoperability will be possible (even though the
on-disk format could be identical).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:50 AM, Ferenc-Levente Juhos wrote:
 Actually although as you pointed out that the chances to have an sha256
 collision is minimal, but still it can happen, that would mean
 that the dedup algorithm discards a block that he thinks is a duplicate.
 Probably it's anyway better to do a byte to byte comparison
 if the hashes match to be sure that the blocks are really identical.
 
 The funny thing here is that ZFS tries to solve all sorts of data integrity
 issues with checksumming and healing, etc.,
 and on the other hand a hash collision in the dedup algorithm can cause
 loss of data if wrongly configured.
 
 Anyway thanks that you have brought up the subject, now I know if I will
 enable the dedup feature I must set it to sha256,verify.

Oh jeez, I can't remember how many times this flame war has been going
on on this list. Here's the gist: SHA-256 (or any good hash) produces a
near uniform random distribution of output. Thus, the chances of getting
a random hash collision are around 2^-256 or around 10^-77. If I asked
you to pick two atoms at random *from the entire observable universe*,
your chances of hitting on the same atom are higher than the chances of
that hash collision. So leave dedup=on with sha256 and move on.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 11:53 AM, Tomas Forsman wrote:
 On 11 July, 2012 - Sa??o Kiselkov sent me these 1,4K bytes:
 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77. If I asked
 you to pick two atoms at random *from the entire observable universe*,
 your chances of hitting on the same atom are higher than the chances of
 that hash collision. So leave dedup=on with sha256 and move on.
 
 So in ZFS, which normally uses 128kB blocks, you can instead store them
 100% uniquely into 32 bytes.. A nice 4096x compression rate..
 decompression is a bit slower though..

I really mean no disrespect, but this comment is so dumb I could swear
my IQ dropped by a few tenths of a point just by reading.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:00 PM, casper@oracle.com wrote:
 
 
 You do realize that the age of the universe is only on the order of
 around 10^18 seconds, do you? Even if you had a trillion CPUs each
 chugging along at 3.0 GHz for all this time, the number of processor
 cycles you will have executed cumulatively is only on the order 10^40,
 still 37 orders of magnitude lower than the chance for a random hash
 collision.

 
 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions 
 in a zfs dataset with dedup enabled using the same hash algorithm.

That is one possibility I considered when evaluating whether to
implement the Edon-R hash algorithm. It's security margin is somewhat
questionable, so then the next step is to determine whether this can be
used in any way to implement a practical data-corruption attack. I guess
is that it can't, but I'm open to debate on this issue, since I'm not a
security expert myself.

The reason why I don't think this can be used to implement a practical
attack is that in order to generate a collision, you first have to know
the disk block that you want to create a collision on (or at least the
checksum), i.e. the original block is already in the pool. At that
point, you could write a colliding block which would get de-dup'd, but
that doesn't mean you've corrupted the original data, only that you
referenced it. So, in a sense, you haven't corrupted the original block,
only your own collision block (since that's the copy doesn't get written).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions 
 in a zfs dataset with dedup enabled using the same hash algorithm.
 
 Sorry, but isn't this what dedup=verify solves? I don't see the problem here. 
 Maybe all that's needed is a comment in the manpage saying hash algorithms 
 aren't perfect.

It does solve it, but at a cost to normal operation. Every write gets
turned into a read. Assuming a big enough and reasonably busy dataset,
this leads to tremendous write amplification.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:32 PM, Ferenc-Levente Juhos wrote:
 Saso, I'm not flaming at all, I happen to disagree, but still I understand
 that
 chances are very very very slim, but as one poster already said, this is
 how
 the lottery works. I'm not saying one should make an exhaustive search with
 trillions of computers just to produce a sha256 collision.
 If I wanted an exhaustive search I would generate all the numbers from
 0 to 2**256 and I would definitely get at least 1 collision.
 If you formulate it in another way, by generating all the possible 256 bit
 (32 byte)
 blocks + 1 you will definitely get a collision. This is much more credible
 than the analogy with the age of the universe and atoms picked at random,
 etc.

First of all, I never said that the chance is zero. It's definitely
non-zero, but claiming that is analogous to the lottery is just not
appreciating the scale of the difference.

Next, your proposed method of finding hash collisions is naive in that
you assume that you merely need generate 256-bit numbers. First of all,
the smallest blocks in ZFS are 1k (IIRC), i.e. 8192 bits. Next, you fail
to appreciate the difficulty of even generating 2^256 256-bit numbers.
Here's how much memory you'd need:

2^256 * 32 ~= 2^261 bytes

Memory may be cheap, but not that cheap...

 The fact is it can happen, it's entirely possible that there are two jpg's
 in the universe with different content and they have the same hash.
 I can't prove the existence of those, but you can't deny it.

I'm not denying that. Read what I said.

 The fact is, that zfs and everyone using it trys to correct data
 degradation e.g. cause by cosmic rays, and on the other hand their
 using probability calculation (no matter how slim the chances are) to
 potentially discard valid data.
 You can come with other universe and atom theories and with the
 age of the universe, etc. The fact remains the same.

This is just a profoundly naive statement. You always need to make
trade-offs between practicality and performance. How slim the chances
are has a very real impact on engineering decisions.

 And each generation was convinced that their current best checksum or
 hash algorithm is the best and will be the best forever. MD5 has
 demonstrated that it's not the case. Time will tell what becomes of
 SHA256, but why take any chances.

You really don't understand much about hash algorithms, do you? MD5 has
*very* good safety against random collisions - that's why it's still
used in archive integrity checking (ever wonder what algorithm Debian
and RPM packages use?). The reason why it's been replaced by newer
algorithms has nothing to do with its chance of random hash collisions,
but with deliberate collision attacks on security algorithms built on
top of it. Also, the reason why we don't use it in ZFS is because
SHA-256 is faster and it has a larger pattern space (thus further
lowering the odds of random collisions).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 12:37 PM, Ferenc-Levente Juhos wrote:
 Precisely, I said the same thing a few posts before:
 dedup=verify solves that. And as I said, one could use dedup=hash
 algorithm,verify with
 an inferior hash algorithm (that is much faster) with the purpose of
 reducing the number of dedup candidates.
 For that matter one could use a trivial CRC32, if the two blocks have the
 same CRC you do anyway a
 byte-to-byte comparison, but if they have differenct CRC's you don't need
 to do the actual comparison.

Yes, and that's exactly what the Fletcher4+verify combo does (Fletcher
is faster than CRC32). That's why I started this thread - to assess
whether a better hash algorithm is needed and/or desired.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:09 PM, Justin Stringfellow wrote:
 The point is that hash functions are many to one and I think the point
 was about that verify wasn't really needed if the hash function is good
 enough.
 
 This is a circular argument really, isn't it? Hash algorithms are never 
 perfect, but we're trying to build a perfect one?
  
 It seems to me the obvious fix is to use hash to identify candidates for 
 dedup, and then do the actual verify and dedup asynchronously. Perhaps a 
 worker thread doing this at low priority?
 Did anyone consider this?

This assumes you have low volumes of deduplicated data. As your dedup
ratio grows, so does the performance hit from dedup=verify. At, say,
dedupratio=10.0x, on average, every write results in 10 reads.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:36 PM, casper@oracle.com wrote:
 
 
 This assumes you have low volumes of deduplicated data. As your dedup
 ratio grows, so does the performance hit from dedup=verify. At, say,
 dedupratio=10.0x, on average, every write results in 10 reads.
 
 I don't follow.
 
 If dedupratio == 10, it means that each item is *referenced* 10 times
 but it is only stored *once*.  Only when you have hash collisions then 
 multiple reads would be needed.
 
 Only one read is needed except in the case of hash collisions.

No, *every* dedup write will result in a block read. This is how:

 1) ZIO gets block X and computes HASH(X)
 2) ZIO looks up HASH(X) in DDT
  2a) HASH(X) not in DDT - unique write; exit
  2b) HASH(X) in DDT; continue
 3) Read original disk block Y with HASH(Y) = HASH(X) --here's the read
 4) Verify X == Y
  4a) X == Y; increment refcount
  4b) X != Y; hash collision; write new block   -- here's the collision

So in other words, by the time you figure out you've got a hash
collision, you already did the read, ergo, every dedup write creates a read!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:42 PM, Justin Stringfellow wrote:
 This assumes you have low volumes of deduplicated data. As your dedup
 ratio grows, so does the performance hit from dedup=verify. At, say,
 dedupratio=10.0x, on average, every write results in 10 reads.
 
 Well you can't make an omelette without breaking eggs! Not a very nice one, 
 anyway.
  
 Yes dedup is expensive but much like using O_SYNC, it's a conscious decision 
 here to take a performance hit in order to be sure about our data. Moving the 
 actual reads to a async thread as I suggested should improve things.

And my point here is that the expense is unnecessary and can be omitted
if we choose our algorithms and settings carefully.

Async here won't help, you'll still get equal write amplification,
only it's going to be spread in between txg's.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris derivate with the best long-term future

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:51 PM, Eugen Leitl wrote:
 
 As a napp-it user who recently needs to upgrade from NexentaCore I recently 
 saw
 preferred for OpenIndiana live but running under Illumian, NexentaCore and 
 Solaris 11 (Express)
 as a system recommendation for napp-it. 
 
 I wonder about the future of OpenIndiana and Illumian, which
 fork is likely to see the most continued development, in your opinion?

I use OpenIndiana personally, since it's the one I'm most familiar with
(direct continuation of OpenSolaris tradition). If you need something
with commercial support in that spirit, I recommend having a look at
OmniOS. Joyent's SmartOS is really interesting, albeit a bit
narrow-profile for my taste (plus, its use of NetBSD packaging means
I'll have to adapt to a new way of doing things and I like IPS very much).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:39 PM, David Magda wrote:
 On Tue, July 10, 2012 19:56, Sašo Kiselkov wrote:
 However, before I start out on a pointless endeavor, I wanted to probe
 the field of ZFS users, especially those using dedup, on whether their
 workloads would benefit from a faster hash algorithm (and hence, lower
 CPU utilization). Developments of late have suggested to me three
 possible candidates:
 [...]
 
 I'd wait until SHA-3 is announced. It's supposed to happen this year, of
 which only six months are left:
 
 http://csrc.nist.gov/groups/ST/hash/timeline.html
 http://en.wikipedia.org/wiki/NIST_hash_function_competition
 
 It was actually supposed to happen to 2Q, so they're running a little
 late it seems.

I'm not convinced waiting makes much sense. The SHA-3 standardization
process' goals are different from ours. SHA-3 can choose to go with
something that's slower, but has a higher security margin. I think that
absolute super-tight security isn't all that necessary for ZFS, since
the hash isn't used for security purposes. We only need something that's
fast and has a good pseudo-random output distribution. That's why I
looked toward Edon-R. Even though it might have security problems in
itself, it's by far the fastest algorithm in the entire competition.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:57 PM, Gregg Wonderly wrote:
 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?

Don't think that, if you can think of this procedure, that the crypto
security guys at universities haven't though about it as well? Of course
they have. No, simply generating a sequence of random patterns and
hoping to hit a match won't do the trick.

P.S. I really don't mean to sound smug or anything, but I know one thing
for sure: the crypto researchers who propose these algorithms are some
of the brightest minds on this topic on the planet, so I would hardly
think they didn't consider trivial problems.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:58 PM, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

 I really mean no disrespect, but this comment is so dumb I could swear
 my IQ dropped by a few tenths of a point just by reading.
 
 Cool it please.  You say I mean no disrespect and then say something which
 is clearly disrespectful.

I sort of flew off the handle there, and I shouldn't have. It felt like
Tomas was misrepresenting my position and putting words in my mouth I
didn't say. I certainly didn't mean to diminish the validity of an
honest question.

 Tomas's point is to illustrate that hashing is a many-to-one function.  If
 it were possible to rely on the hash to always be unique, then you could use
 it as a compression algorithm.  He's pointing out that's insane.  His
 comment was not in the slightest bit dumb; if anything, it seems like maybe
 somebody (or some people) didn't get his point.

I understood his point very well and I never argued that hashing always
results in unique hash values, which is why I thought he was
misrepresenting what I said.

So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)

A formal proof would of course take longer to construct and would take
time that I feel is best spent writing code.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:19 PM, Gregg Wonderly wrote:
 But this is precisely the kind of observation that some people seem to miss 
 out on the importance of.  As Tomas suggested in his post, if this was true, 
 then we could have a huge compression ratio as well.  And even if there was 
 10% of the bit patterns that created non-unique hashes, you could use the 
 fact that a block hashed to a known bit pattern that didn't have collisions, 
 to compress the other 90% of your data.
 
 I'm serious about this from a number of perspectives.  We worry about the 
 time it would take to reverse SHA or RSA hashes to passwords, not even 
 thinking that what if someone has been quietly computing all possible hashes 
 for the past 10-20 years into a database some where, with every 5-16 
 character password, and now has an instantly searchable hash-to-password 
 database.

This is something very well known in the security community as rainbow
tables and a common method to protect against it is via salting. Never
use a password hashing scheme which doesn't use salts for exactly the
reason you outlined above.

 Sometimes we ignore the scale of time, thinking that only the immediately 
 visible details are what we have to work with.
 
 If no one has computed the hashes for every single 4K and 8K block, then 
 fine.  But, if that was done, and we had that data, we'd know for sure which 
 algorithm was going to work the best for the number of bits we are 
 considering.

Do you even realize how many 4K or 8K blocks there are?!?! Exactly
2^32768 or 2^65536 respectively. I wouldn't worry about somebody having
those pre-hashed ;-) Rainbow tables only work for a very limited subset
of data.

 Speculating based on the theory of the algorithms for random number of bits 
 is just silly.  Where's the real data that tells us what we need to know?

If you don't trust math, then I there's little I can do to convince you.
But remember our conversation the next time you step into a car or get
on an airplane. The odds that you'll die on that ride are far higher
than that you'll find a random hash collision in a 256-bit hash algorithm...

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:22 PM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Sašo Kiselkov wrote:
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.
 
 If an algorithm is not 'secure' and zfs is not set to verify, doesn't
 that mean that a knowledgeable user will be able to cause intentional
 data corruption if deduplication is enabled?  A user with very little
 privilege might be able to cause intentional harm by writing the magic
 data block before some other known block (which produces the same hash)
 is written.  This allows one block to substitute for another.
 
 It does seem that security is important because with a human element,
 data is not necessarily random.

Theoretically yes, it is possible, but the practicality of such an
attack is very much in doubt. In case this is a concern, however, one
can always switch to a more secure hash function (e.g. Skein-512).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:23 PM, casper@oracle.com wrote:
 
 On Tue, 10 Jul 2012, Edward Ned Harvey wrote:

 CPU's are not getting much faster.  But IO is definitely getting faster.  
 It's best to keep ahea
 d of that curve.

 It seems that per-socket CPU performance is doubling every year. 
 That seems like faster to me.
 
 I think that I/O isn't getting as fast as CPU is; memory capacity and
 bandwith and CPUs are getting faster.  I/O, not so much.
 (Apart from the one single step from harddisk to SSD; but note that
 I/O is limited to standard interfaces and as such it is likely be
 helddown by requiring a new standard.

Have you seen one of those SSDs made by FusionIO? Those things fit in a
single PCI-e x8 slot and can easily push a sustained rate upward of
several GB/s. Do not expect that drives are the be-all end-all to
storage. Hybrid storage invalidated the traditional CPU  memory fast,
disks slow wisdom years ago.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:27 PM, Gregg Wonderly wrote:
 Unfortunately, the government imagines that people are using their home 
 computers to compute hashes and try and decrypt stuff.  Look at what is 
 happening with GPUs these days.  People are hooking up 4 GPUs in their 
 computers and getting huge performance gains.  5-6 char password space 
 covered in a few days.  12 or so chars would take one machine a couple of 
 years if I recall.  So, if we had 20 people with that class of machine, we'd 
 be down to a few months.   I'm just suggesting that while the compute space 
 is still huge, it's not actually undoable, it just requires some thought into 
 how to approach the problem, and then some time to do the computations.
 
 Huge space, but still finite…

There are certain physical limits which one cannot exceed. For instance,
you cannot store 2^256 units of 32-byte quantities in Earth. Even if you
used proton spin (or some other quantum property) to store a bit, there
simply aren't enough protons in the entire visible universe to do it.
You will never ever be able to search a 256-bit memory space using a
simple exhaustive search. The reason why our security hashes are so long
(256-bits, 512-bits, more...) is because attackers *don't* do an
exhaustive search.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:30 PM, Gregg Wonderly wrote:
 This is exactly the issue for me.  It's vital to always have verify on.  If 
 you don't have the data to prove that every possible block combination 
 possible, hashes uniquely for the small bit space we are talking about, 
 then how in the world can you say that verify is not necessary?  That just 
 seems ridiculous to propose.

Do you need assurances that in the next 5 seconds a meteorite won't fall
to Earth and crush you? No. And yet, the Earth puts on thousands of tons
of weight each year from meteoric bombardment and people have been hit
and killed by them (not to speak of mass extinction events). Nobody has
ever demonstrated of being able to produce a hash collision in any
suitably long hash (128-bits plus) using a random search. All hash
collisions have been found by attacking the weaknesses in the
mathematical definition of these functions (i.e. some part of the input
didn't get obfuscated well in the hash function machinery and spilled
over into the result, resulting in a slight, but usable non-randomness).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:36 PM, Justin Stringfellow wrote:
 
 
 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?
  
 Perhaps I've missed something, but if there was *never* a collision, you'd 
 have stumbled across a rather impressive lossless compression algorithm. I'm 
 pretty sure there's some Big Mathematical Rules (Shannon?) that mean this 
 cannot be.

Do you realize how big your lookup dictionary would have to be?

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:39 PM, Ferenc-Levente Juhos wrote:
 As I said several times before, to produce hash collisions. Or to calculate
 rainbow tables (as a previous user theorized it) you only need the
 following.
 
 You don't need to reproduce all possible blocks.
 1. SHA256 produces a 256 bit hash
 2. That means it produces a value on 256 bits, in other words a value
 between 0..2^256 - 1
 3. If you start counting from 0 to 2^256 and for each number calculate the
 SHA256 you will get at least one hash collision (if the hash algortihm is
 prefectly distributed)
 4. Counting from 0 to 2^256, is nothing else but reproducing all possible
 bit pattern on 32 bytes
 
 It's not about whether one computer is capable of producing the above
 hashes or not, or whether there are actually that many unique 32 byte bit
 patterns in the universe.
 A collision can happen.

It's actually not that simple, because in hash collision attacks you're
not always afforded the luxury of being able to define your input block.
More often than not, you want to modify a previously hashed block in
such a fashion that it carries your intended modifications while hashing
to the same original value. Say for instance you want to modify a
512-byte message (e.g. an SSL certificate) to point to your own CN. Here
your rainbow table, even if you could store it somewhere (you couldn't,
btw), would do you little good here.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:54 PM, Ferenc-Levente Juhos wrote:
 You don't have to store all hash values:
 a. Just memorize the first one SHA256(0)
 b. start cointing
 c. bang: by the time you get to 2^256 you get at least a collision.

Just one question: how long do you expect this going to take on average?
Come on, do the math!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:56 PM, Gregg Wonderly wrote:
 So, if I had a block collision on my ZFS pool that used dedup, and it had my 
 bank balance of $3,212.20 on it, and you tried to write your bank balance of 
 $3,292,218.84 and got the same hash, no verify, and thus you got my 
 block/balance and now your bank balance was reduced by 3 orders of magnitude, 
 would you be okay with that?  What assurances would you be content with using 
 my ZFS pool?

I'd feel entirely safe. There, I said it.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:10 PM, David Magda wrote:
 On Wed, July 11, 2012 09:45, Sašo Kiselkov wrote:
 
 I'm not convinced waiting makes much sense. The SHA-3 standardization
 process' goals are different from ours. SHA-3 can choose to go with
 something that's slower, but has a higher security margin. I think that
 absolute super-tight security isn't all that necessary for ZFS, since
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.
 
 Fair enough, though I think eventually the SHA-3 winner will be
 incorporated into hardware (or at least certain instructions used in the
 algorithm will). I think waiting a few more weeks/months shouldn't be a
 big deal, as the winner should be announced Real Soon Now, and then a more
 informed decision can probably be made.

The AES process winner had been announced in October 2000. Considering
AES-NI was proposed in March 2008 and first silicon for it appeared
around January 2010, I wouldn't hold my breath hoping for hardware
SHA-3-specific acceleration getting a widespread foothold for at least
another 5-10 years (around 2-3 technology generations).

That being said, a lot can be achieved using SIMD instructions, but that
doesn't depend on the SHA-3 process in any way.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:33 PM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Sašo Kiselkov wrote:

 The reason why I don't think this can be used to implement a practical
 attack is that in order to generate a collision, you first have to know
 the disk block that you want to create a collision on (or at least the
 checksum), i.e. the original block is already in the pool. At that
 point, you could write a colliding block which would get de-dup'd, but
 that doesn't mean you've corrupted the original data, only that you
 referenced it. So, in a sense, you haven't corrupted the original block,
 only your own collision block (since that's the copy doesn't get
 written).
 
 This is not correct.  If you know the well-known block to be written,
 then you can arrange to write your collision block prior to when the
 well-known block is written.  Therefore, it is imperative that the hash
 algorithm make it clearly impractical to take a well-known block and
 compute a collision block.
 
 For example, the well-known block might be part of a Windows anti-virus
 package, or a Windows firewall configuration, and corrupting it might
 leave a Windows VM open to malware attack.

True, but that may not be enough to produce a practical collision for
the reason that while you know which bytes you want to attack, these
might not line up with ZFS disk blocks (especially the case with Windows
VMs which are store in large opaque zvols) - such an attack would
require physical access to the machine (at which point you can simply
manipulate the blocks directly).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
 You're entirely sure that there could never be two different blocks that can 
 hash to the same value and have different content?
 
 Wow, can you just send me the cash now and we'll call it even?

You're the one making the positive claim and I'm calling bullshit. So
the onus is on you to demonstrate the collision (and that you arrived at
it via your brute force method as described). Until then, my money stays
safely on my bank account. Put up or shut up, as the old saying goes.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 06:23 PM, Gregg Wonderly wrote:
 What I'm saying is that I am getting conflicting information from your 
 rebuttals here.

Well, let's address that then:

 I (and others) say there will be collisions that will cause data loss if 
 verify is off.

Saying that there will be without any supporting evidence to back it
up amounts to a prophecy.

 You say it would be so rare as to be impossible from your perspective.

Correct.

 Tomas says, well then lets just use the hash value for a 4096X compression.
 You fluff around his argument calling him names.

Tomas' argument was, as I understood later, an attempt at sarcasm.
Nevertheless, I later explained exactly why I consider the
hash-compression claim total and utter bunk:

So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)

 I say, well then compute all the possible hashes for all possible bit 
 patterns and demonstrate no dupes.

This assumes it's possible to do so. Frenc made a similar claim and I
responded with this question: how long do you expect this going to take
on average? Come on, do the math!. I pose the same to you. Find the
answer and you'll understand exactly why what you're proposing is
impossible.

 You say it's not possible to do that.

Please go on and compute a reduced size of the problem for, say, 2^64
32-byte values (still a laughably small space for the problem, but I'm
feeling generous). Here's the amount of storage you'll need:

2^64 * 32 = 524288 Exabytes

And that's for a problem that I've reduced for you by 192 orders of
magnitude. You see, only when you do the math you realize how off base
you are in claiming that pre-computation of hash rainbow tables for
generic bit patterns is doable.

 I illustrate a way that loss of data could cost you money.

That's merely an emotional argument where you are trying to attack me by
trying to invoke an emotional response from when my ass is on the
line. Sorry, that doesn't invalidate the original argument that you
can't do rainbow table pre-computation for long bit patterns.

 You say it's impossible for there to be a chance of me constructing a block 
 that has the same hash but different content.

To make sure we're not using ambiguous rhetoric here, allow me to
summarize my position: you cannot produce, in practical terms, a hash
collision on a 256-bit secure hash algorithm using a brute-force algorithm.

 Several people have illustrated that 128K to 32bits is a huge and lossy ratio 
 of compression, yet you still say it's viable to leave verify off.

Except that we're not talking 128K to 32b, but 128K to 256b. Also, only
once you appreciate the mathematics behind the size of the 256-bit
pattern space can you understand why leaving verify off is okay.

 I say, in fact that the total number of unique patterns that can exist on any 
 pool is small, compared to the total, illustrating that I understand how the 
 key space for the algorithm is small when looking at a ZFS pool, and thus 
 could have a non-collision opportunity.

This is so profoundly wrong that it leads me to suspect you never took
courses on cryptography and/or information theory. The size of your
storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even
if your pool were the size of a single block, we're talking here about
the *mathematical* possibility of hitting on a random block that hashes
to the same value. Given a stream of random data blocks (thus simulating
an exhaustive brute-force search) and a secure pseudo-random hash
function (which has a roughly equal chance of producing any output value
for a given input block), you've got only a 10^-77 chance of getting a
hash collision. If you don't understand how this works, read a book on
digital coding theory.

 So I can see what perspective you are drawing your confidence from, but I, 
 and others, are not confident that the risk has zero probability.

I never said the risk is zero. The risk non-zero, but is so close to
zero, that you may safely ignore it (since we take much greater risks on
a daily basis without so much as a blink of an eye).

 I'm pushing you to find a way to demonstrate that there is zero risk because 
 if you do that, then you've, in fact created the ultimate compression factor 
 (but enlarged the keys that could collide because the pool is now virtually 
 larger), to date for random bit patterns, and you've also demonstrated that 
 the particular algorithm is very good for dedup. 
 That would indicate to me, that you can then take that algorithm, and run it 
 inside of ZFS dedup to automatically manage when verify is necessary by 
 detecting when a collision occurs.

Do 

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:06 PM, Bill Sommerfeld wrote:
 On 07/11/12 02:10, Sašo Kiselkov wrote:
 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77.
 
 I think you're correct that most users don't need to worry about this --
 sha-256 dedup without verification is not going to cause trouble for them.
 
 But your analysis is off.  You're citing the chance that two blocks picked at
 random will have the same hash.  But that's not what dedup does; it compares
 the hash of a new block to a possibly-large population of other hashes, and
 that gets you into the realm of birthday problem or birthday paradox.
 
 See http://en.wikipedia.org/wiki/Birthday_problem for formulas.
 
 So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
 collision in really large collections of data - still not likely enough to
 worry about.

Yeah, I know, I did this as a quick first-degree approximation. However,
the provided range is still very far above the chances of getting a
random bit-rot error that even Fletcher won't catch.

 Of course, that assumption goes out the window if you're concerned that an
 adversary may develop practical ways to find collisions in sha-256 within the
 deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
 and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
 http://en.wikipedia.org/wiki/SHA-1#SHA-1).

Of course, this is theoretically possible, however, I do not expect such
an attack to be practical within any reasonable time frame of the
deployment. In any case, should a realistic need to solve this arise, we
can always simply switch hashes (I'm also planning to implement
Skein-512/256) and do a recv/send to rewrite everything on disk. PITA?
Yes. Serious problem? Don't think so.

 on a somewhat less serious note, perhaps zfs dedup should contain chinese
 lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation)
 which asks the sysadmin to report a detected sha-256 collision to
 eprint.iacr.org or the like...

How about we ask them to report to me instead, like so:

1) Detect collision
2) Report to me
3) ???
4) Profit!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Sašo Kiselkov
On 07/12/2012 07:16 PM, Tim Cook wrote:
 Sasso: yes, it's absolutely worth implementing a higher performing hashing
 algorithm.  I'd suggest simply ignoring the people that aren't willing to
 acknowledge basic mathematics rather than lashing out.  No point in feeding
 the trolls.  The PETABYTES of data Quantum and DataDomain have out there
 are proof positive that complex hashes get the job done without verify,
 even if you don't want to acknowledge the math behind it.

That's what I deduced as well. I have far too much time to explain
statistics, coding theory and compression algorithms to random people on
the Internet. Code talks. I'm almost done implementing SHA-512/256 and
Edon-R-512/256. I've spent most of the time learning how the ZFS code is
structured and I now have most of the pieces in place, so expect the
completed code drop with (SHA-512, Edon-R and Skein) by this weekend.

Cheers,

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-12 Thread Sašo Kiselkov
On 07/12/2012 09:52 PM, Sašo Kiselkov wrote:
 I have far too much time to explain

P.S. that should have read I have taken far too much time explaining.
Men are crap at multitasking...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow speed problem with a new SAS shelf

2012-07-23 Thread Sašo Kiselkov
Hi,

Have you had a look iostat -E (error counters) to make sure you don't
have faulty cabling? I've bad cables trip me up once in a manner similar
to your situation here.

Cheers,
--
Saso

On 07/23/2012 07:18 AM, Yuri Vorobyev wrote:
 Hello.
 
 I faced with a strange performance problem with new disk shelf.
 We a using  ZFS system with SATA disks for a while.
 It is Supermicro SC846-E16 chassis, Supermicro X8DTH-6F motherboard with
 96Gb RAM and 24 HITACHI HDS723020BLA642 SATA disks attached to onboard
 LSI 2008 controller.
 
 Pretty much satisfied with it we bought additional shelf with SAS disks
 for VMs hosting. New shelf is Supermicro SC846-E26 chassis. Disks model
 is HITACHI HUS156060VLS600 (15K 600Gb SAS2).
 Additional controller LSI 9205-8e was installed in server and connected
 with JBOD.
 I connected JBOD with 2 channels and setup multi path first, but when i
 noticed performance problem i disabled multi path and disconnected one
 cable (for sure it is not multipath cause the problem).
 
 Problem description follow:
 
 Creating test pool with 5 pair of mirrors (new shelf, SAS disks)
 
 # zpool create -o version=28 -O primarycache=none test mirror
 c9t5000CCA02A138899d0 c9t5000CCA02A102181d0 mirror c9t5000CCA02A13500Dd0
 c9t5000CCA02A13316Dd0 mirror c9t5000CCA02A005699d0 c9t5000CCA02A004271d0
 mirror c9t5000CCA02A004229d0 c9t5000CCA02A1342CDd0 mirror
 c9t5000CCA02A1251E5d0 c9t5000CCA02A1151DDd0
 
 (primarycache=none) to disable ARC influence
 
 
 Testing sequential write
 # dd if=/dev/zero of=/test/zero bs=1M count=2048
 2048+0 records in
 2048+0 records out
 2147483648 bytes (2.1 GB) copied, 1.04272 s, 2.1 GB/s
 
 iostat when writing look like
  r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  0.0 1334.60.0 165782.9  0.0  8.40.06.3   1  86
 c9t5000CCA02A1151DDd0
  0.0 1345.50.0 169575.3  0.0  8.70.06.5   1  88
 c9t5000CCA02A1342CDd0
  2.0 1359.51.0 168969.8  0.0  8.70.06.4   1  90
 c9t5000CCA02A13500Dd0
  0.0 1358.50.0 168714.0  0.0  8.70.06.4   1  90
 c9t5000CCA02A13316Dd0
  0.0 1345.50.0 19.3  0.0  9.00.06.7   1  92
 c9t5000CCA02A102181d0
  1.0 1317.51.0 164456.9  0.0  8.50.06.5   1  88
 c9t5000CCA02A004271d0
  4.0 1342.52.0 166282.2  0.0  8.50.06.3   1  88
 c9t5000CCA02A1251E5d0
  0.0 1377.50.0 170515.5  0.0  8.70.06.3   1  90
 c9t5000CCA02A138899d0
 
 Now read
 # dd if=/test/zero of=/dev/null  bs=1M
 2048+0 records in
 2048+0 records out
 2147483648 bytes (2.1 GB) copied, 13.5681 s, 158 MB/s
 
 iostat when reading
  r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
106.00.0 11417.40.0  0.0  0.20.02.4   0  14
 c9t5000CCA02A004271d0
 80.00.0 10239.90.0  0.0  0.20.02.4   0  10
 c9t5000CCA02A1251E5d0
110.00.0 12182.40.0  0.0  0.10.01.3   0   9
 c9t5000CCA02A138899d0
102.00.0 11664.40.0  0.0  0.20.01.8   0  15
 c9t5000CCA02A005699d0
 99.00.0 10900.90.0  0.0  0.30.03.0   0  16
 c9t5000CCA02A004229d0
107.00.0 11545.40.0  0.0  0.20.01.9   0  13
 c9t5000CCA02A1151DDd0
 81.00.0 10367.90.0  0.0  0.20.02.2   0  11
 c9t5000CCA02A1342CDd0
 
 Unexpected low speed! Note the busy column. When writing it about 90%,
 when reading it about 15%
 
 Individual disks raw read speed (don't be confused with name change. i
 connect JBOD to another HBA channel)
 
 # dd if=/dev/dsk/c8t5000CCA02A13889Ad0 of=/dev/null bs=1M count=2000
 2000+0 records in
 2000+0 records out
 2097152000 bytes (2.1 GB) copied, 10.9685 s, 191 MB/s
 # dd if=/dev/dsk/c8t5000CCA02A1342CEd0 of=/dev/null bs=1M count=2000
 2000+0 records in
 2000+0 records out
 2097152000 bytes (2.1 GB) copied, 10.8024 s, 194 MB/s
 
 The 10-disks mirror zpool read slower than a single disk.
 
 There is no tuning in /etc/system
 
 I tried test with FreeBSD 8.3 live CD. Reads was the same (about
 150Mb/s). Also i tried SmartOS, but it can't see disks behind LSI
 9205-8e controller.
 
 For compare this is speed from SATA pool (it consist of 4 6-disk raidz2
 vdev)
 #dd if=CentOS-6.2-x86_64-bin-DVD1.iso of=/dev/null bs=1M
 4218+1 records in
 4218+1 records out
 4423129088 bytes (4.4 GB) copied, 4.76552 s, 928 MB/s
 
  r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   13614.40.0 800338.50.0  0.1 36.00.02.6   0 914 c6
459.90.0 25761.40.0  0.0  0.80.01.8   0  22
 c6t5000CCA369D16860d0
 84.00.0 2785.20.0  0.0  0.20.03.0   0  13
 c6t5000CCA369D1B1E0d0
836.90.0 50089.50.0  0.0  2.60.03.1   0  60
 c6t5000CCA369D1B302d0
411.00.0 24492.60.0  0.0  0.80.02.1   0  25
 c6t5000CCA369D16982d0
821.90.0 49385.10.0  0.0  3.00.03.7   0  67
 c6t5000CCA369CFBDA3d0
231.00.0 12292.50.0  0.0  0.50.02.3   0  18
 c6t5000CCA369D17E73d0
 

Re: [zfs-discuss] online increase of zfs after LUN increase ?

2012-07-25 Thread Sašo Kiselkov
On 07/25/2012 05:49 PM, Habony, Zsolt wrote:
 Hello,
   There is a feature of zfs (autoexpand, or zpool online -e ) that it can 
 consume the increased LUN immediately and increase the zpool size.
 That would be a very useful ( vital ) feature in enterprise environment.
 
 Though when I tried to use it, it did not work.  LUN expanded and visible in 
 format, but zpool did not increase.
 I found a bug SUNBUG:6430818 (Solaris Does Not Automatically Handle an 
 Increase in LUN Size) 
 Bad luck. 
 
 Patch exists: 148098 but _not_ part of recommended patch set.  Thus my fresh 
 install Sol 10 U9 with latest patch set still has the problem.  ( Strange 
 that this problem 
  is not considered high impact ... )
 
 It mentiones a workaround :   zpool export, Re-label the LUN using 
 format(1m) command., zpool import
 
 Can you pls. help in that, what does that re-label mean ?  
 (As I need to ask downtime for the zone now ... , would like to prepare for 
 what I need to do )
 
 I used format utility in thousands of times, for organizing partitions, 
 though I have no idea how I would relabel a disk.
 Also I did not use format to label the disks, I gave the LUN to zpool 
 directly, I would not dare to touch or resize any partition with format 
 utility, not knowing what zpool wants to see there.
 
 Have you experienced such problem, and do you know how to increase zpool 
 after a LUN increase ?

Relabel means simply running the labeling command in the format
utility after you've made changes to the slices. As long as you keep the
start cluster of a slice the same and don't shrink it, nothing bad
should happen.

Are you doing this on a root pool?

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-29 Thread Sašo Kiselkov
On 07/29/2012 04:07 PM, Jim Klimov wrote:
 Hello, list

Hi Jim,

   For several times now I've seen statements on this list implying
 that a dedicated ZIL/SLOG device catching sync writes for the log,
 also allows for more streamlined writes to the pool during normal
 healthy TXG syncs, than is the case with the default ZIL located
 within the pool.
 
   Is this understanding correct? Does it apply to any generic writes,
 or only to sync-heavy scenarios like databases or NFS servers?

Yes, it is correct. It applies to all writes. If the log is allocated on
a slog devices, then the synchronous log records don't fragment the
pool. As far as I understand it, txgs happen sequentially even with no
slog device present, but the log entries don't - they occur as is needed
to fulfill the sync write request with minimum latency.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL devices and fragmentation

2012-07-29 Thread Sašo Kiselkov
On 07/29/2012 06:01 PM, Jim Klimov wrote:
 2012-07-29 19:50, Sašo Kiselkov wrote:
 On 07/29/2012 04:07 PM, Jim Klimov wrote:
For several times now I've seen statements on this list implying
 that a dedicated ZIL/SLOG device catching sync writes for the log,
 also allows for more streamlined writes to the pool during normal
 healthy TXG syncs, than is the case with the default ZIL located
 within the pool.

Is this understanding correct? Does it apply to any generic writes,
 or only to sync-heavy scenarios like databases or NFS servers?

 Yes, it is correct. It applies to all writes. If the log is allocated on
 a slog devices, then the synchronous log records don't fragment the
 pool. As far as I understand it, txgs happen sequentially even with no
 slog device present, but the log entries don't - they occur as is needed
 to fulfill the sync write request with minimum latency.
 
 Thanks, I thought similarly, but the persistent on-list mention
 (or words that could be interpreted that way) that with SLOG
 devices writes ought to be better coalesced and less fragmented,
 I started getting confused. :)
 
 So, I guess, if the sync-write proportion on a particular system
 is negligible (and that can be measured with dtrace scripts),
 then a slog won't help much with fragmentation of generic async
 writes, right?

Correct.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Sašo Kiselkov
On 08/01/2012 12:04 PM, Jim Klimov wrote:
 Probably DDT is also stored with 2 or 3 copies of each block,
 since it is metadata. It was not in the last ZFS on-disk spec
 from 2006 that I found, for some apparent reason ;)

That's probably because it's extremely big (dozens, hundreds or even
thousands of GB).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Sašo Kiselkov
On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
  
 Availability of the DDT is IMHO crucial to a deduped pool, so
 I won't be surprised to see it forced to triple copies. 
 
 IMHO, the more important thing for dedup moving forward is to create an 
 option to dedicate a fast device (SSD or whatever) to the DDT.  So all those 
 little random IO operations never hit the rusty side of the pool.

That's something you can already do with an L2ARC. In the future I plan
on investigating implementing a set of more fine-grained ARC and L2ARC
policy tuning parameters that would give more control into the hands of
admins over how the ARC/L2ARC cache is used.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-08-01 Thread Sašo Kiselkov
On 08/01/2012 04:14 PM, Jim Klimov wrote:
 2012-08-01 17:55, Sašo Kiselkov пишет:
 On 08/01/2012 03:35 PM, opensolarisisdeadlongliveopensolaris wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov

 Availability of the DDT is IMHO crucial to a deduped pool, so
 I won't be surprised to see it forced to triple copies.

 IMHO, the more important thing for dedup moving forward is to create
 an option to dedicate a fast device (SSD or whatever) to the DDT.  So
 all those little random IO operations never hit the rusty side of the
 pool.

 That's something you can already do with an L2ARC. In the future I plan
 on investigating implementing a set of more fine-grained ARC and L2ARC
 policy tuning parameters that would give more control into the hands of
 admins over how the ARC/L2ARC cache is used.
 
 
 Unfortunately, as of current implementations, L2ARC starts up cold.

Yes, that's by design, because the L2ARC is simply a secondary backing
store for ARC blocks. If the memory pointer isn't valid, chances are,
you'll still be able to find the block on the L2ARC devices. You can't
scan an L2ARC device and discover some usable structures, as there
aren't any. It's literally just a big pile of disk blocks and their
associated ARC headers only live in RAM.

 chances are that
 some blocks of userdata might be more popular than a DDT block and
 would push it out of L2ARC as well...

Which is why I plan on investigating implementing some tunable policy
module that would allow the administrator to get around this problem.
E.g. administrator dedicates 50G of ARC space to metadata (which
includes the DDT) or only the DDT specifically. My idea is still a bit
fuzzy, but it revolves primarily around allocating and policing min and
max quotas for a given ARC entry type. I'll start a separate discussion
thread for this later on once I have everything organized in my mind
about where I plan on taking this.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] number of blocks changes

2012-08-03 Thread Sašo Kiselkov
On 08/03/2012 03:18 PM, Justin Stringfellow wrote:
 While this isn't causing me any problems, I'm curious as to why this is 
 happening...:
 
 
 
 $ dd if=/dev/random of=ob bs=128k count=1  while true

Can you check whether this happens from /dev/urandom as well?

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-06 Thread Sašo Kiselkov
On 08/07/2012 12:12 AM, Christopher George wrote:
 Is your DDRdrive product still supported and moving?
 
 Yes, we now exclusively target ZIL acceleration.
 
 We will be at the upcoming OpenStorage Summit 2012,
 and encourage those attending to stop by our booth and
 say hello :-)
 
 http://www.openstoragesummit.org/
 
 Is it well supported for Illumos?
 
 Yes!  Customers using Illumos derived distros make-up a
 good portion of our customer base.

How come I haven't seen new products coming from you guys? I mean, the
X1 is past 3 years old and some improvements would be sort of expected
in that timeframe. Off the top of my head, I'd welcome things such as:

 *) Increased capacity for high-volume applications.

 *) Remove the requirement to have an external UPS (couple of
supercaps? microbattery?)

 *) Use cheaper MLC flash to lower cost - it's only written to in case
of a power outage, anyway so lower write cycles aren't an issue and
modern MLC is almost as fast as SLC at sequential IO (within 10%
usually).

 *) PCI Express 3.0 interface (perhaps even x4)

 *) Soldered-on DRAM to create a true low-profile card (the current DIMM
slots look like a weird dirty hack).

 *) At least updated benchmarks your site to compare against modern
flash-based competition (not the Intel X25-E, which is seriously
stone age by now...)

 *) Lower price, lower price, lower price.
I can get 3-4 200GB OCZ Talos-Rs for $2k FFS. That means I could
equip my machine with one to two mirrored slogs and nearly 800GB
worth of L2ARC for the price of a single X1.

I mean this as constructive criticism, not as angry bickering. I totally
respect you guys doing your own thing.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-07 Thread Sašo Kiselkov
On 08/07/2012 02:18 AM, Christopher George wrote:
 I mean this as constructive criticism, not as angry bickering. I totally
 respect you guys doing your own thing.
 
 Thanks, I'll try my best to address your comments...

Thanks for your kind reply, though there are some points I'd like to
address, if that's okay.

 *) Increased capacity for high-volume applications.
 
 We do have a select number of customers striping two
 X1s for a total capacity of 8GB, but for a majority of our customers 4GB
 is perfect.  Increasing capacity
 obviously increases the cost, so we wanted the baseline
 capacity to reflect a solution to most but not every need.

Certainly, for most uses this isn't an issue. I just threw that in
there, considering how cheap DRAM and flash is nowadays and how easy it
is to create disk pools which push 2GB/s in write thoughput, I was
hoping you guys would be keeping pace with that (getting 4GB of sync
writes in the txg commit window can be tough, but not unthinkable). In
any case, simply dismissing it by saying that simply get two, you are
effectively doubling my slog costs which, considering the recommended
practice is to get a slog mirror, would mean that I have to get 4 X1's.
That's $8k in list prices and 8 full-height PCI-e slots wasted (seeing
as how an X1 is wider than the standard PCI-e card). Not many systems
can do that (that's why I said solder the DRAM and go low-profile).

 *) Remove the requirement to have an external UPS (couple of
supercaps? microbattery?)
 
 Done!  We will be formally introducing an optional DDRdrive
 SuperCap PowerPack at the upcoming OpenStorage Summit.

Great! Though I suppose that will inflate the price even further (seeing
as you used the word optional).

 *) Use cheaper MLC flash to lower cost - it's only written to in case
of a power outage, anyway so lower write cycles aren't an issue and
modern MLC is almost as fast as SLC at sequential IO (within 10%
usually).
 
 We will be staying with SLC not only for performance but
 longevity/reliability.
 Check out the specifications (ie erase/program cycles and required ECC)
 for a modern 20 nm MLC chip and then let me know if this is where you
 *really* want to cut costs :)

MLC is so much cheaper that you can simply slap on twice as much and use
the rest for ECC, mirroring or simply overprovisioning sectors. The
common practice to extending the lifecycle of MLC is by short-stroking
it, i.e. using only a fraction of the capacity. E.g. a 40GB MLC unit
with 5-10k cycles per cell can be turned into a 4GB unit (with the
controller providing wear leveling) with effectively 50-100k cycles
(that's SLC land) for about a hundred bucks. Also, since I'm mirroring
it already with ZFS checksums to provide integrity checking, your
argument simply doesn't hold up.

Oh and don't count on Illumos missing support for SCSI Unmap or SATA
TRIM forever. Work is underway to rectify this situation.

 *) PCI Express 3.0 interface (perhaps even x4)
 
 Our product is FPGA based and the PCIe capability is the biggest factor
 in determining component cost.  When we introduced the X1, the FPGA cost
 *alone* to support just PCIe Gen2 x8 was greater than the current street
 price of the DDRdrive X1.

I always had a bit of an issue with non-hotswappable storage systems.
What if an X1 slog dies? I need to power the machine down, open it up,
take out the slog, put it another one and power it back up. Since ZFS
has slog removal support, there's no reason to go for non-hotpluggable
slogs anyway. What about 6G SAS? Dual ported you could push around
12Gbit/s of bandwidth to/from the device, way more than the current
250MB/s, and get hotplug support in there too.

 *) At least updated benchmarks your site to compare against modern
flash-based competition (not the Intel X25-E, which is seriously
stone age by now...)
 
 I completely agree we need to refresh the website, not even the photos
 are representative of our shipping product (we now offer VLP DIMMs).
 We are engineers first and foremost, but an updated website is in the
 works.
 
 In the mean time, we have benchmarked against both the Intel 320/710
 in my OpenStorage Summit 2011 presentation which can be found at:
 
 http://www.ddrdrive.com/zil_rw_revelation.pdf

I always had a bit of an issue with your benchmarks. First off, you're
only ever doing synthetics. They are very nice, but don't provide much
in terms of real-world perspective. Try and compare on price too. Take
something like a Dell R720, stick in the equivalent (in terms of cost!)
of DRAM SSDs and Flash SSDs (i.e. for X1 you're looking at like 4 Intel
710s) and run some real workloads (database benchmarks, virtualization
benchmarks, etc.). Experiment beats theory, every time.

 *) Lower price, lower price, lower price.
I can get 3-4 200GB OCZ Talos-Rs for $2k FFS. That means I could
equip my machine with one to two mirrored slogs and nearly 800GB
worth of L2ARC for the price of a single X1.
 
 I 

Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-07 Thread Sašo Kiselkov
On 08/07/2012 04:08 PM, Bob Friesenhahn wrote:
 On Tue, 7 Aug 2012, Sašo Kiselkov wrote:

 MLC is so much cheaper that you can simply slap on twice as much and use
 the rest for ECC, mirroring or simply overprovisioning sectors. The
 common practice to extending the lifecycle of MLC is by short-stroking
 it, i.e. using only a fraction of the capacity. E.g. a 40GB MLC unit
 with 5-10k cycles per cell can be turned into a 4GB unit (with the
 controller providing wear leveling) with effectively 50-100k cycles
 (that's SLC land) for about a hundred bucks. Also, since I'm mirroring
 it already with ZFS checksums to provide integrity checking, your
 argument simply doesn't hold up.
 
 Remember he also said that the current product is based principally on
 an FPGA.  This FPGA must be interfacing directly with the Flash device
 so it would need to be substantially redesigned to deal with MLC Flash
 (probably at least an order of magnitude more complex), or else a
 microcontroller would need to be added to the design, and firmware would
 handle the substantial complexities.  If the Flash device writes slower,
 then the power has to stay up longer.  If the Flash device reads slower,
 then it takes longer for the drive to come back on line.

Yeah, I know, but then, you can interface with an existing
industry-standard flash controller, no need to design your own (reinvent
the wheel). The choice of FPGA is good for some things, but flexibility
in exchanging components certainly isn't one of them.

If I were designing something akin to the X1, I'd go with a generic
embedded CPU design (e.g. a PowerPC core) interfacing with standard
flash components and running the primary front-end from the chip's
on-board DRAM. I mean, just to give you some perspective, for $2k I
could build a full computer with 8GB of mirrored ECC DRAM which
interfaces via an off-the-shelf 6G SAS HBA (with two 4x wide 6G SAS
ports) or perhaps even an InfiniBand adapter with RDMA with the host
machine, includes a small SSD in it's SATA bay and a tiny UPS battery to
run the whole thing for a few minutes while we write DRAM contents to
flash in case of a power outage (the current X1 doesn't even include
this in its base design). And that's something I could do with
off-the-shelf components for less than $2k (probably a whole lot less)
with a production volume of _1_.

 Quite a lot of product would need to be sold in order to pay for both
 re-engineering and the cost of running a business.

Sure, that's why I said it's David vs. Goliath. However, let's be honest
here, the X1 isn't a terribly complex product. It's quite literally a
tiny computer with some DRAM and a feature to dump DRAM contents to
Flash (and read it back later) in case power fails. That's it.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FreeBSD ZFS

2012-08-09 Thread Sašo Kiselkov
On 08/09/2012 12:52 PM, Joerg Schilling wrote:
 Jim Klimov jimkli...@cos.ru wrote:
 
 In the end, the open-sourced ZFS community got no public replies
 from Oracle regarding collaboration or lack thereof, and decided
 to part ways and implement things independently from Oracle.
 AFAIK main ZFS development converges in illumos-gate, contributed
 to by some OpenSolaris-derived distros and being the upstream for
 FreeBSD port of ZFS (probably others too).
 
 To me it seems that the open-sourced ZFS community is not open, or could 
 you 
 point me to their mailing list archives?
 
 Jörg
 

z...@lists.illumos.org

Welcome.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FreeBSD ZFS

2012-08-09 Thread Sašo Kiselkov
On 08/09/2012 01:05 PM, Joerg Schilling wrote:
 Sa?o Kiselkov skiselkov...@gmail.com wrote:
 
 To me it seems that the open-sourced ZFS community is not open, or could 
 you 
 point me to their mailing list archives?

 Jörg


 z...@lists.illumos.org
 
 Well, why then has there been a discussion about a closed zfs mailing list?
 Is this no longer true?

Not that I know of. The above one is where I post my changes and Matt,
George, Garrett and all the others are lurking there.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FreeBSD ZFS

2012-08-09 Thread Sašo Kiselkov
On 08/09/2012 01:11 PM, Joerg Schilling wrote:
 Sa?o Kiselkov skiselkov...@gmail.com wrote:
 
 On 08/09/2012 01:05 PM, Joerg Schilling wrote:
 Sa?o Kiselkov skiselkov...@gmail.com wrote:

 To me it seems that the open-sourced ZFS community is not open, or 
 could you 
 point me to their mailing list archives?

 Jörg


 z...@lists.illumos.org

 Well, why then has there been a discussion about a closed zfs mailing 
 list?
 Is this no longer true?

 Not that I know of. The above one is where I post my changes and Matt,
 George, Garrett and all the others are lurking there.
 
 So if you frequently read this list, can you tell me whether they discuss the 
 on-disk format in this list?

It's more of a list for development discussion and integration of
changes, not a list for general ZFS discussion like zfs-discuss is.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >