Re: [zfs-discuss] Solaris derivate with the best long-term future

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 01:51 PM, Eugen Leitl wrote:
 
 As a napp-it user who recently needs to upgrade from NexentaCore I recently 
 saw
 preferred for OpenIndiana live but running under Illumian, NexentaCore and 
 Solaris 11 (Express)
 as a system recommendation for napp-it. 
 
 I wonder about the future of OpenIndiana and Illumian, which
 fork is likely to see the most continued development, in your opinion?

I use OpenIndiana personally, since it's the one I'm most familiar with
(direct continuation of OpenSolaris tradition). If you need something
with commercial support in that spirit, I recommend having a look at
OmniOS. Joyent's SmartOS is really interesting, albeit a bit
narrow-profile for my taste (plus, its use of NetBSD packaging means
I'll have to adapt to a new way of doing things and I like IPS very much).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:39 PM, David Magda wrote:
 On Tue, July 10, 2012 19:56, Sašo Kiselkov wrote:
 However, before I start out on a pointless endeavor, I wanted to probe
 the field of ZFS users, especially those using dedup, on whether their
 workloads would benefit from a faster hash algorithm (and hence, lower
 CPU utilization). Developments of late have suggested to me three
 possible candidates:
 [...]
 
 I'd wait until SHA-3 is announced. It's supposed to happen this year, of
 which only six months are left:
 
 http://csrc.nist.gov/groups/ST/hash/timeline.html
 http://en.wikipedia.org/wiki/NIST_hash_function_competition
 
 It was actually supposed to happen to 2Q, so they're running a little
 late it seems.

I'm not convinced waiting makes much sense. The SHA-3 standardization
process' goals are different from ours. SHA-3 can choose to go with
something that's slower, but has a higher security margin. I think that
absolute super-tight security isn't all that necessary for ZFS, since
the hash isn't used for security purposes. We only need something that's
fast and has a good pseudo-random output distribution. That's why I
looked toward Edon-R. Even though it might have security problems in
itself, it's by far the fastest algorithm in the entire competition.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:57 PM, Gregg Wonderly wrote:
 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?

Don't think that, if you can think of this procedure, that the crypto
security guys at universities haven't though about it as well? Of course
they have. No, simply generating a sequence of random patterns and
hoping to hit a match won't do the trick.

P.S. I really don't mean to sound smug or anything, but I know one thing
for sure: the crypto researchers who propose these algorithms are some
of the brightest minds on this topic on the planet, so I would hardly
think they didn't consider trivial problems.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 03:58 PM, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Sašo Kiselkov

 I really mean no disrespect, but this comment is so dumb I could swear
 my IQ dropped by a few tenths of a point just by reading.
 
 Cool it please.  You say I mean no disrespect and then say something which
 is clearly disrespectful.

I sort of flew off the handle there, and I shouldn't have. It felt like
Tomas was misrepresenting my position and putting words in my mouth I
didn't say. I certainly didn't mean to diminish the validity of an
honest question.

 Tomas's point is to illustrate that hashing is a many-to-one function.  If
 it were possible to rely on the hash to always be unique, then you could use
 it as a compression algorithm.  He's pointing out that's insane.  His
 comment was not in the slightest bit dumb; if anything, it seems like maybe
 somebody (or some people) didn't get his point.

I understood his point very well and I never argued that hashing always
results in unique hash values, which is why I thought he was
misrepresenting what I said.

So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)

A formal proof would of course take longer to construct and would take
time that I feel is best spent writing code.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:19 PM, Gregg Wonderly wrote:
 But this is precisely the kind of observation that some people seem to miss 
 out on the importance of.  As Tomas suggested in his post, if this was true, 
 then we could have a huge compression ratio as well.  And even if there was 
 10% of the bit patterns that created non-unique hashes, you could use the 
 fact that a block hashed to a known bit pattern that didn't have collisions, 
 to compress the other 90% of your data.
 
 I'm serious about this from a number of perspectives.  We worry about the 
 time it would take to reverse SHA or RSA hashes to passwords, not even 
 thinking that what if someone has been quietly computing all possible hashes 
 for the past 10-20 years into a database some where, with every 5-16 
 character password, and now has an instantly searchable hash-to-password 
 database.

This is something very well known in the security community as rainbow
tables and a common method to protect against it is via salting. Never
use a password hashing scheme which doesn't use salts for exactly the
reason you outlined above.

 Sometimes we ignore the scale of time, thinking that only the immediately 
 visible details are what we have to work with.
 
 If no one has computed the hashes for every single 4K and 8K block, then 
 fine.  But, if that was done, and we had that data, we'd know for sure which 
 algorithm was going to work the best for the number of bits we are 
 considering.

Do you even realize how many 4K or 8K blocks there are?!?! Exactly
2^32768 or 2^65536 respectively. I wouldn't worry about somebody having
those pre-hashed ;-) Rainbow tables only work for a very limited subset
of data.

 Speculating based on the theory of the algorithms for random number of bits 
 is just silly.  Where's the real data that tells us what we need to know?

If you don't trust math, then I there's little I can do to convince you.
But remember our conversation the next time you step into a car or get
on an airplane. The odds that you'll die on that ride are far higher
than that you'll find a random hash collision in a 256-bit hash algorithm...

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:22 PM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Sašo Kiselkov wrote:
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.
 
 If an algorithm is not 'secure' and zfs is not set to verify, doesn't
 that mean that a knowledgeable user will be able to cause intentional
 data corruption if deduplication is enabled?  A user with very little
 privilege might be able to cause intentional harm by writing the magic
 data block before some other known block (which produces the same hash)
 is written.  This allows one block to substitute for another.
 
 It does seem that security is important because with a human element,
 data is not necessarily random.

Theoretically yes, it is possible, but the practicality of such an
attack is very much in doubt. In case this is a concern, however, one
can always switch to a more secure hash function (e.g. Skein-512).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:23 PM, casper@oracle.com wrote:
 
 On Tue, 10 Jul 2012, Edward Ned Harvey wrote:

 CPU's are not getting much faster.  But IO is definitely getting faster.  
 It's best to keep ahea
 d of that curve.

 It seems that per-socket CPU performance is doubling every year. 
 That seems like faster to me.
 
 I think that I/O isn't getting as fast as CPU is; memory capacity and
 bandwith and CPUs are getting faster.  I/O, not so much.
 (Apart from the one single step from harddisk to SSD; but note that
 I/O is limited to standard interfaces and as such it is likely be
 helddown by requiring a new standard.

Have you seen one of those SSDs made by FusionIO? Those things fit in a
single PCI-e x8 slot and can easily push a sustained rate upward of
several GB/s. Do not expect that drives are the be-all end-all to
storage. Hybrid storage invalidated the traditional CPU  memory fast,
disks slow wisdom years ago.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:27 PM, Gregg Wonderly wrote:
 Unfortunately, the government imagines that people are using their home 
 computers to compute hashes and try and decrypt stuff.  Look at what is 
 happening with GPUs these days.  People are hooking up 4 GPUs in their 
 computers and getting huge performance gains.  5-6 char password space 
 covered in a few days.  12 or so chars would take one machine a couple of 
 years if I recall.  So, if we had 20 people with that class of machine, we'd 
 be down to a few months.   I'm just suggesting that while the compute space 
 is still huge, it's not actually undoable, it just requires some thought into 
 how to approach the problem, and then some time to do the computations.
 
 Huge space, but still finite…

There are certain physical limits which one cannot exceed. For instance,
you cannot store 2^256 units of 32-byte quantities in Earth. Even if you
used proton spin (or some other quantum property) to store a bit, there
simply aren't enough protons in the entire visible universe to do it.
You will never ever be able to search a 256-bit memory space using a
simple exhaustive search. The reason why our security hashes are so long
(256-bits, 512-bits, more...) is because attackers *don't* do an
exhaustive search.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:30 PM, Gregg Wonderly wrote:
 This is exactly the issue for me.  It's vital to always have verify on.  If 
 you don't have the data to prove that every possible block combination 
 possible, hashes uniquely for the small bit space we are talking about, 
 then how in the world can you say that verify is not necessary?  That just 
 seems ridiculous to propose.

Do you need assurances that in the next 5 seconds a meteorite won't fall
to Earth and crush you? No. And yet, the Earth puts on thousands of tons
of weight each year from meteoric bombardment and people have been hit
and killed by them (not to speak of mass extinction events). Nobody has
ever demonstrated of being able to produce a hash collision in any
suitably long hash (128-bits plus) using a random search. All hash
collisions have been found by attacking the weaknesses in the
mathematical definition of these functions (i.e. some part of the input
didn't get obfuscated well in the hash function machinery and spilled
over into the result, resulting in a slight, but usable non-randomness).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:36 PM, Justin Stringfellow wrote:
 
 
 Since there is a finite number of bit patterns per block, have you tried to 
 just calculate the SHA-256 or SHA-512 for every possible bit pattern to see 
 if there is ever a collision?  If you found an algorithm that produced no 
 collisions for any possible block bit pattern, wouldn't that be the win?
  
 Perhaps I've missed something, but if there was *never* a collision, you'd 
 have stumbled across a rather impressive lossless compression algorithm. I'm 
 pretty sure there's some Big Mathematical Rules (Shannon?) that mean this 
 cannot be.

Do you realize how big your lookup dictionary would have to be?

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:39 PM, Ferenc-Levente Juhos wrote:
 As I said several times before, to produce hash collisions. Or to calculate
 rainbow tables (as a previous user theorized it) you only need the
 following.
 
 You don't need to reproduce all possible blocks.
 1. SHA256 produces a 256 bit hash
 2. That means it produces a value on 256 bits, in other words a value
 between 0..2^256 - 1
 3. If you start counting from 0 to 2^256 and for each number calculate the
 SHA256 you will get at least one hash collision (if the hash algortihm is
 prefectly distributed)
 4. Counting from 0 to 2^256, is nothing else but reproducing all possible
 bit pattern on 32 bytes
 
 It's not about whether one computer is capable of producing the above
 hashes or not, or whether there are actually that many unique 32 byte bit
 patterns in the universe.
 A collision can happen.

It's actually not that simple, because in hash collision attacks you're
not always afforded the luxury of being able to define your input block.
More often than not, you want to modify a previously hashed block in
such a fashion that it carries your intended modifications while hashing
to the same original value. Say for instance you want to modify a
512-byte message (e.g. an SSL certificate) to point to your own CN. Here
your rainbow table, even if you could store it somewhere (you couldn't,
btw), would do you little good here.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:54 PM, Ferenc-Levente Juhos wrote:
 You don't have to store all hash values:
 a. Just memorize the first one SHA256(0)
 b. start cointing
 c. bang: by the time you get to 2^256 you get at least a collision.

Just one question: how long do you expect this going to take on average?
Come on, do the math!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 04:56 PM, Gregg Wonderly wrote:
 So, if I had a block collision on my ZFS pool that used dedup, and it had my 
 bank balance of $3,212.20 on it, and you tried to write your bank balance of 
 $3,292,218.84 and got the same hash, no verify, and thus you got my 
 block/balance and now your bank balance was reduced by 3 orders of magnitude, 
 would you be okay with that?  What assurances would you be content with using 
 my ZFS pool?

I'd feel entirely safe. There, I said it.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:10 PM, David Magda wrote:
 On Wed, July 11, 2012 09:45, Sašo Kiselkov wrote:
 
 I'm not convinced waiting makes much sense. The SHA-3 standardization
 process' goals are different from ours. SHA-3 can choose to go with
 something that's slower, but has a higher security margin. I think that
 absolute super-tight security isn't all that necessary for ZFS, since
 the hash isn't used for security purposes. We only need something that's
 fast and has a good pseudo-random output distribution. That's why I
 looked toward Edon-R. Even though it might have security problems in
 itself, it's by far the fastest algorithm in the entire competition.
 
 Fair enough, though I think eventually the SHA-3 winner will be
 incorporated into hardware (or at least certain instructions used in the
 algorithm will). I think waiting a few more weeks/months shouldn't be a
 big deal, as the winner should be announced Real Soon Now, and then a more
 informed decision can probably be made.

The AES process winner had been announced in October 2000. Considering
AES-NI was proposed in March 2008 and first silicon for it appeared
around January 2010, I wouldn't hold my breath hoping for hardware
SHA-3-specific acceleration getting a widespread foothold for at least
another 5-10 years (around 2-3 technology generations).

That being said, a lot can be achieved using SIMD instructions, but that
doesn't depend on the SHA-3 process in any way.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:33 PM, Bob Friesenhahn wrote:
 On Wed, 11 Jul 2012, Sašo Kiselkov wrote:

 The reason why I don't think this can be used to implement a practical
 attack is that in order to generate a collision, you first have to know
 the disk block that you want to create a collision on (or at least the
 checksum), i.e. the original block is already in the pool. At that
 point, you could write a colliding block which would get de-dup'd, but
 that doesn't mean you've corrupted the original data, only that you
 referenced it. So, in a sense, you haven't corrupted the original block,
 only your own collision block (since that's the copy doesn't get
 written).
 
 This is not correct.  If you know the well-known block to be written,
 then you can arrange to write your collision block prior to when the
 well-known block is written.  Therefore, it is imperative that the hash
 algorithm make it clearly impractical to take a well-known block and
 compute a collision block.
 
 For example, the well-known block might be part of a Windows anti-virus
 package, or a Windows firewall configuration, and corrupting it might
 leave a Windows VM open to malware attack.

True, but that may not be enough to produce a practical collision for
the reason that while you know which bytes you want to attack, these
might not line up with ZFS disk blocks (especially the case with Windows
VMs which are store in large opaque zvols) - such an attack would
require physical access to the machine (at which point you can simply
manipulate the blocks directly).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 05:58 PM, Gregg Wonderly wrote:
 You're entirely sure that there could never be two different blocks that can 
 hash to the same value and have different content?
 
 Wow, can you just send me the cash now and we'll call it even?

You're the one making the positive claim and I'm calling bullshit. So
the onus is on you to demonstrate the collision (and that you arrived at
it via your brute force method as described). Until then, my money stays
safely on my bank account. Put up or shut up, as the old saying goes.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 06:23 PM, Gregg Wonderly wrote:
 What I'm saying is that I am getting conflicting information from your 
 rebuttals here.

Well, let's address that then:

 I (and others) say there will be collisions that will cause data loss if 
 verify is off.

Saying that there will be without any supporting evidence to back it
up amounts to a prophecy.

 You say it would be so rare as to be impossible from your perspective.

Correct.

 Tomas says, well then lets just use the hash value for a 4096X compression.
 You fluff around his argument calling him names.

Tomas' argument was, as I understood later, an attempt at sarcasm.
Nevertheless, I later explained exactly why I consider the
hash-compression claim total and utter bunk:

So for a full explanation of why hashes aren't usable for compression:

 1) they are one-way (kind of bummer for decompression)
 2) they operate far below the Shannon limit (i.e. unusable for
lossless compression)
 3) their output is pseudo-random, so even if we find collisions, we
have no way to distinguish which input was the most likely one meant
for a given hash value (all are equally probable)

 I say, well then compute all the possible hashes for all possible bit 
 patterns and demonstrate no dupes.

This assumes it's possible to do so. Frenc made a similar claim and I
responded with this question: how long do you expect this going to take
on average? Come on, do the math!. I pose the same to you. Find the
answer and you'll understand exactly why what you're proposing is
impossible.

 You say it's not possible to do that.

Please go on and compute a reduced size of the problem for, say, 2^64
32-byte values (still a laughably small space for the problem, but I'm
feeling generous). Here's the amount of storage you'll need:

2^64 * 32 = 524288 Exabytes

And that's for a problem that I've reduced for you by 192 orders of
magnitude. You see, only when you do the math you realize how off base
you are in claiming that pre-computation of hash rainbow tables for
generic bit patterns is doable.

 I illustrate a way that loss of data could cost you money.

That's merely an emotional argument where you are trying to attack me by
trying to invoke an emotional response from when my ass is on the
line. Sorry, that doesn't invalidate the original argument that you
can't do rainbow table pre-computation for long bit patterns.

 You say it's impossible for there to be a chance of me constructing a block 
 that has the same hash but different content.

To make sure we're not using ambiguous rhetoric here, allow me to
summarize my position: you cannot produce, in practical terms, a hash
collision on a 256-bit secure hash algorithm using a brute-force algorithm.

 Several people have illustrated that 128K to 32bits is a huge and lossy ratio 
 of compression, yet you still say it's viable to leave verify off.

Except that we're not talking 128K to 32b, but 128K to 256b. Also, only
once you appreciate the mathematics behind the size of the 256-bit
pattern space can you understand why leaving verify off is okay.

 I say, in fact that the total number of unique patterns that can exist on any 
 pool is small, compared to the total, illustrating that I understand how the 
 key space for the algorithm is small when looking at a ZFS pool, and thus 
 could have a non-collision opportunity.

This is so profoundly wrong that it leads me to suspect you never took
courses on cryptography and/or information theory. The size of your
storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even
if your pool were the size of a single block, we're talking here about
the *mathematical* possibility of hitting on a random block that hashes
to the same value. Given a stream of random data blocks (thus simulating
an exhaustive brute-force search) and a secure pseudo-random hash
function (which has a roughly equal chance of producing any output value
for a given input block), you've got only a 10^-77 chance of getting a
hash collision. If you don't understand how this works, read a book on
digital coding theory.

 So I can see what perspective you are drawing your confidence from, but I, 
 and others, are not confident that the risk has zero probability.

I never said the risk is zero. The risk non-zero, but is so close to
zero, that you may safely ignore it (since we take much greater risks on
a daily basis without so much as a blink of an eye).

 I'm pushing you to find a way to demonstrate that there is zero risk because 
 if you do that, then you've, in fact created the ultimate compression factor 
 (but enlarged the keys that could collide because the pool is now virtually 
 larger), to date for random bit patterns, and you've also demonstrated that 
 the particular algorithm is very good for dedup. 
 That would indicate to me, that you can then take that algorithm, and run it 
 inside of ZFS dedup to automatically manage when verify is necessary by 
 detecting when a collision occurs.

Do 

Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Sašo Kiselkov
On 07/11/2012 10:06 PM, Bill Sommerfeld wrote:
 On 07/11/12 02:10, Sašo Kiselkov wrote:
 Oh jeez, I can't remember how many times this flame war has been going
 on on this list. Here's the gist: SHA-256 (or any good hash) produces a
 near uniform random distribution of output. Thus, the chances of getting
 a random hash collision are around 2^-256 or around 10^-77.
 
 I think you're correct that most users don't need to worry about this --
 sha-256 dedup without verification is not going to cause trouble for them.
 
 But your analysis is off.  You're citing the chance that two blocks picked at
 random will have the same hash.  But that's not what dedup does; it compares
 the hash of a new block to a possibly-large population of other hashes, and
 that gets you into the realm of birthday problem or birthday paradox.
 
 See http://en.wikipedia.org/wiki/Birthday_problem for formulas.
 
 So, maybe somewhere between 10^-50 and 10^-55 for there being at least one
 collision in really large collections of data - still not likely enough to
 worry about.

Yeah, I know, I did this as a quick first-degree approximation. However,
the provided range is still very far above the chances of getting a
random bit-rot error that even Fletcher won't catch.

 Of course, that assumption goes out the window if you're concerned that an
 adversary may develop practical ways to find collisions in sha-256 within the
 deployment lifetime of a system.  sha-256 is, more or less, a scaled-up sha-1,
 and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from
 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see
 http://en.wikipedia.org/wiki/SHA-1#SHA-1).

Of course, this is theoretically possible, however, I do not expect such
an attack to be practical within any reasonable time frame of the
deployment. In any case, should a realistic need to solve this arise, we
can always simply switch hashes (I'm also planning to implement
Skein-512/256) and do a recv/send to rewrite everything on disk. PITA?
Yes. Serious problem? Don't think so.

 on a somewhat less serious note, perhaps zfs dedup should contain chinese
 lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation)
 which asks the sysadmin to report a detected sha-256 collision to
 eprint.iacr.org or the like...

How about we ask them to report to me instead, like so:

1) Detect collision
2) Report to me
3) ???
4) Profit!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] New fast hash algorithm - is it needed?

2012-07-10 Thread Sašo Kiselkov
Hi guys,

I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS
implementation to supplant the currently utilized sha256. On modern
64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much
slower than many of the SHA-3 candidates, so I went out and did some
testing (details attached) on a possible new hash algorithm that might
improve on this situation.

However, before I start out on a pointless endeavor, I wanted to probe
the field of ZFS users, especially those using dedup, on whether their
workloads would benefit from a faster hash algorithm (and hence, lower
CPU utilization). Developments of late have suggested to me three
possible candidates:

 * SHA-512: simplest to implement (since the code is already in the
   kernel) and provides a modest performance boost of around 60%.

 * Skein-512: overall fastest of the SHA-3 finalists and much faster
   than SHA-512 (around 120-150% faster than the current sha256).

 * Edon-R-512: probably the fastest general purpose hash algorithm I've
   ever seen (upward of 300% speedup over sha256) , but might have
   potential security problems (though I don't think this is of any
   relevance to ZFS, as it doesn't use the hash for any kind of security
   purposes, but only for data integrity  dedup).

My testing procedure: nothing sophisticated, I took the implementation
of sha256 from the Illumos kernel and simply ran it on a dedicated
psrset (where possible with a whole CPU dedicated, even if only to a
single thread) - I tested both the generic C implementation and the
Intel assembly implementation. The Skein and Edon-R implementations are
in C optimized for 64-bit architectures from the respective authors (the
most up to date versions I could find). All code has been compiled using
GCC 3.4.3 from the repos (the same that can be used for building
Illumos). Sadly, I don't have access to Sun Studio.

Cheers,
--
Saso
Hash preformances on 10 GB of data
gcc (GCC) 3.4.3 (csl-sol210-3_4-20050802)
CFLAGS: -O3 -fomit-frame-pointer -m64

MACHINE #1
CPU: dual AMD Opteron 4234
Options: single thread on no-intr whole-CPU psrset

Algorithm   Result  Improvement
sha256 (ASM)21.19 cycles/byte   (baseline)
sha256 (C)  27.66 cycles/byte   -23.34%

sha512 (ASM)13.48 cycles/byte   57.20%
sha512 (C)  17.35 cycles/byte   22.13%

Skein-512 (C)   8.95 cycles/byte136.76%
Edon-R-512 (C)  4.94 cycles/byte328.94%

MACHINE #2
CPU: single AMD Athlon II Neo N36L
Options: single thread on no-intr 1-core psrset

Algorithm   Result  Improvement
sha256 (ASM)15.68 cycles/byte   (baseline)
sha256 (C)  18.81 cycles/byte   -16.64%

sha512 (ASM)9.95 cycles/byte57.59%
sha512 (C)  11.84 cycles/byte   32.43%

Skein-512 (C)   6.25 cycles/byte150.88%
Edon-R-512 (C)  3.66 cycles/byte328.42%

MACHINE #3
CPU: dual Intel Xeon E5645
Options: single thread on no-intr whole-CPU psrset

Algorithm   Result  Improvement
sha256 (ASM)15.49 cycles/byte   (baseline)
sha256 (C)  17.90 cycles/byte   -13.46%

sha512 (ASM)9.88 cycles/byte56.78%
sha512 (C)  11.44 cycles/byte   35.40%

Skein-512 (C)   6.88 cycles/byte125.15%
Edon-R-512 (C)  3.35 cycles/byte362.39%

MACHINE #4
CPU: single Intel Xeon E5405
Options: single thread on no-intr 1-core psrset

Algorithm   Result  Improvement
sha256 (ASM)17.45 cycles/byte   (baseline)
sha256 (C)  18.34 cycles/byte   -4.85%

sha512 (ASM)10.24 cycles/byte   70.41%
sha512 (C)  11.72 cycles/byte   48.90%

Skein-512 (C)   7.32 cycles/byte138.39%
Edon-R-512 (C)  3.86 cycles/byte352.07%

MACHINE #5
CPU: dual Intel Xeon E5450
Options: single thread on no-intr whole-CPU psrset

Algorithm   Result  Improvement
sha256 (ASM)16.43 cycles/byte   (baseline)
sha256 (C)  18.50 cycles/byte   -11.19%

sha512 (ASM)10.37 cycles/byte   58.44%
sha512 (C)  11.85 cycles/byte   38.65%

Skein-512 (C)   7.38 cycles/byte122.63%
Edon-R-512 (C)  3.88 cycles/byte323.45%

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-19 Thread Sašo Kiselkov
On 06/19/2012 11:05 AM, Sašo Kiselkov wrote:
 On 06/18/2012 07:50 PM, Roch wrote:

 Are we hitting :
  7167903 Configuring VLANs results in single threaded soft ring fanout
 
 Confirmed, it is definitely this.

Hold the phone, I just tried unconfiguring all of the VLANs in the
system and went to pure interfaces and it didn't help. So while the
issue stems from the soft ring fanout, it's probably not caused by
VLANs. Thanks for the pointers anyway, though.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-18 Thread Sašo Kiselkov
On 06/18/2012 12:05 AM, Richard Elling wrote:
 You might try some of the troubleshooting techniques described in Chapter 5 
 of the DTtrace book by Brendan Gregg and Jim Mauro. It is not clear from your
 description that you are seeing the same symptoms, but the technique should
 apply.
  -- richard

Thanks for the advice, I'll try it. In the mean time, I'm beginning to
suspect I'm hitting some PCI-e issue on the Dell R715 machine. Looking at

# mdb -k
::interrupts
IRQ  Vect IPL BusTrg Type   CPU Share APIC/INT# ISR(s)
.[snip]
91   0x82 7   PCIEdg MSI5   1 - pcieb_intr_handler
.[snip].

In mpstat I can see that during normal operation, CPU 5 is nearly floored:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  50   00   5120 105400  8700 00  93   0   7

Then, when anything hits which disturbs the PCI-e bus (e.g. a txg flush
or the xcall storm), the CPU goes to 100% utilization and my networking
throughput drops accordingly. The issue can be softened by lowering the
input bandwidth from ~46MB/s to below 20MB/s - at that point I'm getting
only about 10% utilization on the core in question and no xcall storm or
txg flush can influence my network (though I do see the CPU get about
70% busy during the process, but still enough left to avoid packet loss).

So it seems, I'm hitting some hardware design issue, or something...
I'll try moving my network card to the second PCI-e I/O bridge tomorrow
(which seems to be bound to CPU 6).

Any other ideas on what I might try to get the PCI-e I/O bridge
bandwidth back? Or how to fight the starvation of the CPU by other
activities in the system? (xcalls and/or txg flushes) I already tried
putting the CPUs in question into an empty processor set, but that isn't
enough, it seems.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-17 Thread Sašo Kiselkov
On 06/13/2012 03:43 PM, Roch wrote:
 
 Sašo Kiselkov writes:
   On 06/12/2012 05:37 PM, Roch Bourbonnais wrote:

So the xcall are necessary part of memory reclaiming, when one needs to 
 tear down the TLB entry mapping the physical memory (which can from here on 
 be repurposed).
So the xcall are just part of this. Should not cause trouble, but they 
 do. They consume a cpu for some time.

That in turn can cause infrequent latency bubble on the network. A 
 certain root cause of these latency bubble is that network thread are bound 
 by default and
if the xcall storm ends up on the CPU that the network thread is bound 
 to, it will wait for the storm to pass.
   
   I understand, but the xcall storm settles only eats up a single core out
   of a total of 32, plus it's not a single specific one, it tends to
   change, so what are the odds of hitting the same core as the one on
   which the mac thread is running?
   
 
 That's easy :-) : 1/32 each time it needs to run. So depending on how often 
 it runs (which depends on how
 much churn there is in the ARC) and how often you see the latency bubbles, 
 that may or may
 not be it.
 
 What is zio_taskq_batch_pct on your system ? That is another storm bit of 
 code which
 causes bubble. Setting it down to 50 (versus an older default of 100) should 
 help if it's
 not done already.
 
 -r

So I tried all of the suggestions above (mac unbinding, zio_taskq
tuning) and none helped. I'm beginning to suspect it has something to do
with the networking cards. When I try and snoop filtered traffic from
one interface into a file (snoop -o /tmp/dump -rd vlan935 host
a.b.c.d), my multicast reception throughput plummets to about 1/3 of
the original.

I'm running a link-aggregation of 4 on-board Broadcom NICs:

# dladm show-aggr -x
LINK PORT SPEED DUPLEX   STATE ADDRESSPORTSTATE
aggr0--   1000Mb fullupd0:67:e5:fc:bd:38  --
 bnx1 1000Mb fullupd0:67:e5:fc:bd:38  attached
 bnx2 1000Mb fullupd0:67:e5:fc:bd:3a  attached
 bnx3 1000Mb fullupd0:67:e5:fc:bd:3c  attached
 bnx0 1000Mb fullupd0:67:e5:fc:bd:36  attached

# dladm show-vlan
LINKVID  OVER FLAGS
vlan49  49   aggr0-
vlan934 934  aggr0-
vlan935 935  aggr0-

Normally, I'm getting around 46MB/s on vlan935, however, once I run any
snoop command which puts the network interfaces into promisc mode, my
throughput plummets to around 20MB/s. During that I can see context
switches skyrocket on 4 CPU cores and them being around 75% busy. Now I
understand that snoop has some probe effect, but this is definitely too
large. I've never seen this kind of bad behavior before on any of my
other Solaris systems (with similar load).

Are there any tunings I can make to my network to track down the issue?
My module for bnx is:

# modinfo | grep bnx
169 f80a7000  63ba0 197   1  bnx (Broadcom NXII GbE 6.0.1)

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks

2012-06-15 Thread Sašo Kiselkov
On 06/15/2012 03:35 PM, Johannes Totz wrote:
 On 15/06/2012 13:22, Sašo Kiselkov wrote:
 On 06/15/2012 02:14 PM, Hans J Albertsson wrote:
 I've got my root pool on a mirror on 2 512 byte blocksize disks. I
 want to move the root pool to two 2 TB disks with 4k blocks. The
 server only has room for two disks. I do have an esata connector,
 though, and a suitable external cabinet for connecting one extra disk.

 How would I go about migrating/expanding the root pool to the
 larger disks so I can then use the larger disks for booting?
 I have no extra machine to use.

 Suppose we call the disks like so:

   A, B: your old 512-block drives
   X, Y: your new 2TB drives

 The easiest way would be to simply:

 1) zpool set autoexpand=on rpool
 2) offline the A drive
 3) physically replace it with the X drive
 4) do a zpool replace on it and wait for it to resilver
 
 When sector size differs, attaching it is going to fail (at least on fbsd).
 You might not get around a send-receive cycle...

Jim Klimov has already posted a way better guide, which rebuilds the
pool using the old one's data, so yeah, the replace route I recommended
here is rendered moot.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
Seems the problem is somewhat more egregious than I thought. The xcall
storm causes my network drivers to stop receiving IP multicast packets
and subsequently my recording applications record bad data, so
ultimately, this kind of isn't workable... I need to somehow resolve
this... I'm running four on-board Broadcom NICs in an LACP
aggregation. Any ideas on why this might be a side-effect? I'm really
kind of out of ideas here...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 03:57 PM, Sašo Kiselkov wrote:
 Seems the problem is somewhat more egregious than I thought. The xcall
 storm causes my network drivers to stop receiving IP multicast packets
 and subsequently my recording applications record bad data, so
 ultimately, this kind of isn't workable... I need to somehow resolve
 this... I'm running four on-board Broadcom NICs in an LACP
 aggregation. Any ideas on why this might be a side-effect? I'm really
 kind of out of ideas here...
 
 Cheers,
 --
 Saso

Just as another datapoint, though I'm not sure if it's going to be much
use, is that I found (via arcstat.pl) that the storms always start
happen when ARC downsizing starts. E.g. I would see the following in
./arcstat.pl 1:

Time  readdmis  dm%  pmis  pm%  mmis  mm%  arcsz c
16:29:4521   00 00 00   111G  111G
16:29:46 0   00 00 00   111G  111G
16:29:47 1   00 00 00   111G  111G
16:29:48 0   00 00 00   111G  111G
16:29:495K   00 00 00   111G  111G
  (this is where the problem starts)
16:29:5036   00 00 00   109G  107G
16:29:5151   00 00 00   107G  107G
16:29:5210   00 00 00   107G  107G
16:29:53   148   00 00 00   107G  107G
16:29:545K   00 00 00   107G  107G
  (and after a while, around 10-15 seconds, it stops)

(I omitted the miss and miss% columns to make the rows fit).

During the time, the network stack is dropping input IP multicast UDP
packets like crazy, so I see my network input drop by about 30-40%.
Truly strange behavior...

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:21 PM, Matt Breitbach wrote:
 I saw this _exact_ problem after I bumped ram from 48GB to 192GB.  Low
 memory pressure seemed to be the cuplrit.  Happened usually during storage
 vmotions or something like that which effectively nullified the data in the
 ARC (sometimes 50GB of data would be purged from the ARC).  The system was
 so busy that it would drop 10Gbit LACP portchannels from our Nexus 5k stack.
 I never got a good solution to this other than to set arc_min_c to something
 that was close to what I wanted the system to use - I settled on setting it
 at ~160GB.  It still dropped the arcsz, but it didn't try to adjust arc_c
 and resulted in significantly fewer xcalls.

Hmm, how do I do that? I don't have that kind of symbol in the kernel.
I'm running OpenIndiana build 151a. My system indeed runs at low memory
pressure, I'm simply running a bunch of writers writing files linearly
with data they received IP/UDP multicast sockets.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:37 PM, Roch Bourbonnais wrote:
 
 So the xcall are necessary part of memory reclaiming, when one needs to tear 
 down the TLB entry mapping the physical memory (which can from here on be 
 repurposed).
 So the xcall are just part of this. Should not cause trouble, but they do. 
 They consume a cpu for some time.
 
 That in turn can cause infrequent latency bubble on the network. A certain 
 root cause of these latency bubble is that network thread are bound by 
 default and
 if the xcall storm ends up on the CPU that the network thread is bound to, it 
 will wait for the storm to pass.

I understand, but the xcall storm settles only eats up a single core out
of a total of 32, plus it's not a single specific one, it tends to
change, so what are the odds of hitting the same core as the one on
which the mac thread is running?

 So try unbinding the mac threads; it may help you here.

How do I do that? All I can find on interrupt fencing and the like is to
simply set certain processors to no-intr, which moves all of the
interrupts and it doesn't prevent the xcall storm choosing to affect
these CPUs either...

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 06:06 PM, Jim Mauro wrote:
 

 So try unbinding the mac threads; it may help you here.

 How do I do that? All I can find on interrupt fencing and the like is to
 simply set certain processors to no-intr, which moves all of the
 interrupts and it doesn't prevent the xcall storm choosing to affect
 these CPUs either…
 
 In /etc/system:
 
 set mac:mac_soft_ring_thread_bind=0
 set mac:mac_srs_thread_bind=0
 
 Reboot required. Verify after reboot with mdb;
 
 echo mac_soft_ring_thread_bind/D | mdb -k

Trying that right now... thanks!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote:
 find where your nics are bound too
 
 mdb -k
 ::interrupts
 
 create a processor set including those cpus [ so just the nic code will
 run there ]
 
 andy

Tried and didn't help, unfortunately. I'm still seeing drops. What's
even funnier is that I'm seeing drops when the machine is sync'ing the
txg to the zpool. So looking at a little UDP receiver I can see the
following input stream bandwidth (the stream is constant bitrate, so
this shouldn't happen):

4.396151 Mbit/s   - drop
5.217205 Mbit/s
5.144323 Mbit/s
5.150227 Mbit/s
5.144150 Mbit/s
4.663824 Mbit/s   - drop
5.178603 Mbit/s
5.148681 Mbit/s
5.153835 Mbit/s
5.141116 Mbit/s
4.532479 Mbit/s   - drop
5.197381 Mbit/s
5.158436 Mbit/s
5.141881 Mbit/s
5.145433 Mbit/s
4.605852 Mbit/s   - drop
5.183006 Mbit/s
5.150526 Mbit/s
5.149324 Mbit/s
5.142306 Mbit/s
4.749443 Mbit/s   - drop

(txg timeout on my system is the default 5s)

It isn't just a slight delay in the arrival of the packets, because then
I should be seeing a rebound on the bitrate, sort of like this:

 ^ |-, ,^, ,^-, ,^
 B |  v   vv
   |
   +--
t -

Instead, what I'm seeing is simply:

 ^ |-, ,-, ,--, ,-
 B |  v   vv
   |
   +--
t -

(The missing spikes after the drops means that there were lost packets
on the NIC.)

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Sašo Kiselkov
On 06/12/2012 07:19 PM, Roch Bourbonnais wrote:
 
 Try with this /etc/system tunings :
 
 set mac:mac_soft_ring_thread_bind=0 set mac:mac_srs_thread_bind=0 
 set zfs:zio_taskq_batch_pct=50
 

Thanks for the recommendations, I'll try and see whether it helps, but
this is going to take me a while (especially since the reboot means
I'll have a clear ARC and need to record up again around 120G of data,
which takes a while to accumulate).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached
to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two
LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm
occasionally seeing a storm of xcalls on one of the 32 VCPUs (10
xcalls a second). The machine is pretty much idle, only receiving a
bunch of multicast video streams and dumping them to the drives (at a
rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of
xcalls that completely eat one of the CPUs, so the mpstat line for the
CPU looks like:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 310   0 102191 1000000 00 100
 0   0

100% busy in the system processing cross-calls. When I tried dtracing
this issue, I found that this is the most likely culprit:

dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}'
   unix`xc_call+0x46
   unix`hat_tlb_inval+0x283
   unix`x86pte_inval+0xaa
   unix`hat_pte_unmap+0xed
   unix`hat_unload_callback+0x193
   unix`hat_unload+0x41
   unix`segkmem_free_vn+0x6f
   unix`segkmem_zio_free+0x27
   genunix`vmem_xfree+0x104
   genunix`vmem_free+0x29
   genunix`kmem_slab_destroy+0x87
   genunix`kmem_slab_free+0x2bb
   genunix`kmem_magazine_destroy+0x39a
   genunix`kmem_depot_ws_reap+0x66
   genunix`taskq_thread+0x285
   unix`thread_start+0x8
3221701

This happens in the sched (pid 0) process. My fsstat one looks like this:

# fsstat /content 1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   setops   ops   ops bytes   ops bytes
0 0 0   664 0952 0 0 0   664 38.0M /content
0 0 0   658 0935 0 0 0   656 38.6M /content
0 0 0   660 0946 0 0 0   659 37.8M /content
0 0 0   677 0969 0 0 0   676 38.5M /content

What's even more puzzling is that this happens apparently entirely
because of some factor other than userland, since I see no changes to
CPU usage of processes in prstat(1M) when this xcall storm happens, only
an increase of loadavg of +1.00 (the busy CPU).

I Googled and found that
http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html
seems to have been an issue identical to mine, however, it remains
unresolved at that time and it worries me about putting this kind of
machine into production use.

Could some ZFS guru please tell me what's going on in segkmem_zio_free?
When I disable the writers to the /content filesystem, this issue goes
away, so it has obviously something to do with disk IO. Thanks!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
On 06/06/2012 04:55 PM, Richard Elling wrote:
 On Jun 6, 2012, at 12:48 AM, Sašo Kiselkov wrote:
 
 So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached
 to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two
 LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm
 occasionally seeing a storm of xcalls on one of the 32 VCPUs (10
 xcalls a second).
 
 That isn't much of a storm, I've seen  1M xcalls in some cases...

Well it does make one of the cores 100% busy for around 10-15 seconds,
so it is processing at the maximum rate the core can do it. I'd call
that a sign of something bad(tm) going on.

 The machine is pretty much idle, only receiving a
 bunch of multicast video streams and dumping them to the drives (at a
 rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of
 xcalls that completely eat one of the CPUs, so the mpstat line for the
 CPU looks like:

 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 310   0 102191 1000000 00 100
 0   0

 100% busy in the system processing cross-calls. When I tried dtracing
 this issue, I found that this is the most likely culprit:

 dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}'
   unix`xc_call+0x46
   unix`hat_tlb_inval+0x283
   unix`x86pte_inval+0xaa
   unix`hat_pte_unmap+0xed
   unix`hat_unload_callback+0x193
   unix`hat_unload+0x41
   unix`segkmem_free_vn+0x6f
   unix`segkmem_zio_free+0x27
   genunix`vmem_xfree+0x104
   genunix`vmem_free+0x29
   genunix`kmem_slab_destroy+0x87
   genunix`kmem_slab_free+0x2bb
   genunix`kmem_magazine_destroy+0x39a
   genunix`kmem_depot_ws_reap+0x66
   genunix`taskq_thread+0x285
   unix`thread_start+0x8
 3221701

 This happens in the sched (pid 0) process. My fsstat one looks like this:

 # fsstat /content 1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   setops   ops   ops bytes   ops bytes
0 0 0   664 0952 0 0 0   664 38.0M /content
0 0 0   658 0935 0 0 0   656 38.6M /content
0 0 0   660 0946 0 0 0   659 37.8M /content
0 0 0   677 0969 0 0 0   676 38.5M /content

 What's even more puzzling is that this happens apparently entirely
 because of some factor other than userland, since I see no changes to
 CPU usage of processes in prstat(1M) when this xcall storm happens, only
 an increase of loadavg of +1.00 (the busy CPU).
 
 What exactly is the workload doing?

As I wrote above, just receiving multicast video streams and writing
them to disk files, nothing else. The fsstat lines above show that -
pure write load.

 Local I/O, iSCSI, NFS, or CIFS?

Purely local I/O via the two LSI SAS controllers, nothing else.

 I Googled and found that
 http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html
 seems to have been an issue identical to mine, however, it remains
 unresolved at that time and it worries me about putting this kind of
 machine into production use.

 Could some ZFS guru please tell me what's going on in segkmem_zio_free?
 
 It is freeing memory.

Yes, but why is this causing a ton of cross-calls?

 When I disable the writers to the /content filesystem, this issue goes
 away, so it has obviously something to do with disk IO. Thanks!
 
 Not directly related to disk I/O bandwidth. Can be directly related to other
 use, such as deletions -- something that causes frees.

When I'm not writing to disk it doesn't happen, so my guess that it
indeed has something to do with (perhaps) ZFS freeing txg buffers or
something...

 Depending on the cause, there can be some tuning that applies for large
 memory machines, where large is = 96 MB.
  -- richard

I'll try and load the machine with dd(1) to the max to see if access
patterns of my software have something to do with it.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
On 06/06/2012 05:01 PM, Sašo Kiselkov wrote:
 I'll try and load the machine with dd(1) to the max to see if access
 patterns of my software have something to do with it.

Tried and tested, any and all write I/O to the pool causes this xcall
storm issue, writing more data to it only exacerbates it (i.e. it occurs
more often). I still get storms of over 100k xcalls completely draining
one CPU core, but now they happen in 20-30s intervals rather than every
1-2 minutes. Writing to the rpool, however, does not, so I suspect it
has something to do with the MPxIO and how ZFS is pumping data into the
twin LSI 9200 controllers. Each is attached to a different CPU I/O
bridge (since the system has two Opterons, it has two I/O bridges, each
handling roughly half of the PCI-e links). I did this in the hope of
improving performance (since the HT links to the I/O bridges will be
more evenly loaded). Any idea of this might be the cause of this issue?

The whole system diagram is:

CPU --(ht)-- IOB --(pcie)-- LSI 9200 --(sas)-,
 |\
(ht)   == JBOD
 |/
CPU --(ht)-- IOB --(pcie)-- LSI 9200 --(sas)-'

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-06 Thread Sašo Kiselkov
On 06/06/2012 09:43 PM, Jim Mauro wrote:
 
 I can't help but be curious about something, which perhaps you verified but
 did not post.
 
 What the data here shows is;
 - CPU 31 is buried in the kernel (100% sys).
 - CPU 31 is handling a moderate-to-high rate of xcalls.
 
 What the data does not prove empirically is that the 100% sys time of
 CPU 31 is in xcall handling.
 
 What's the hot stack when this occurs and you run this;
 
 dtrace -n 'profile-997hz /cpu == 31/ { @[stack()] = count(); }'
 

Thanks for pointing this out. I ran the probe you specified and attached
are the results (I had to chase the xcalls around a bit, because they
were jumping around cores as I was trying to insert the probes). As I
suspected, the most numerous stack trace is the one which causes cross
calls because of the segkmem_zio_free+0x27 code path. While this was
going on, I was getting between 80k and 300k xcalls on the core in question.

The next most common stack was the one ending in mach_cpu_idle and then,
so I'm not sure why the CPU reported 100% busy (perhaps the xcalls were
very expensive on CPU time compared with the 1273 idle's).

Cheers,
--
Saso


xc_call.txt.bz2
Description: application/bzip
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-30 Thread Sašo Kiselkov
On 05/25/2012 08:40 PM, Richard Elling wrote:
 See the soluion at https://www.illumos.org/issues/644
  -- richard

And predictably, I'm back with another n00b question regarding this
array. I've put a pair of LSI-9200-8e controllers in the server and
attached the cables to the enclosure to each of the HBAs. As a result
(why?) I'm getting some really strange behavior:

 * piss poor performance (around 5MB/s per disk tops)
 * fmd(1M) running one core at near 100% saturation each time something
   writes or reads from the pool
 * using fmstat I noticed that its the eft module receiving hundreds of
   fault reports every second
 * fmd is flooded by multipath failover ereports like:

...
May 29 21:11:44.9408 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9423 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.8474 ereport.io.scsi.cmd.disk.recovered
May 29 21:11:44.9455 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9457 ereport.io.scsi.cmd.disk.dev.rqs.derr
May 29 21:11:44.9462 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9527 ereport.io.scsi.cmd.disk.tran
May 29 21:11:44.9535 ereport.io.scsi.cmd.disk.dev.rqs.derr
May 29 21:11:44.6362 ereport.io.scsi.cmd.disk.recovered
...



I suspect that multipath is something not exactly very happy with my
Toshiba disks, but I have no idea what to do to make it work at least
somehow acceptably. I tried messing with scsi_vhci.conf to try and set
load-balance=none, change the scsi-vhci-failover-override for the
Toshiba disks to f_asym_lsi, flashing the latest as well as old firmware
in the cards, reseating them to other PCI-e slots, removing one cable
and even removing one whole HBA, unloading the eft fmd module etc, but
nothing helped so far and I'm sort of out of ideas. Anybody else got an
idea on what I might try?

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-30 Thread Sašo Kiselkov
On 05/30/2012 10:53 PM, Richard Elling wrote:
 On May 30, 2012, at 1:07 PM, Sašo Kiselkov wrote:
 
 On 05/25/2012 08:40 PM, Richard Elling wrote:
 See the soluion at https://www.illumos.org/issues/644
 -- richard

 And predictably, I'm back with another n00b question regarding this
 array. I've put a pair of LSI-9200-8e controllers in the server and
 attached the cables to the enclosure to each of the HBAs. As a result
 (why?) I'm getting some really strange behavior:

 * piss poor performance (around 5MB/s per disk tops)
 * fmd(1M) running one core at near 100% saturation each time something
   writes or reads from the pool
 * using fmstat I noticed that its the eft module receiving hundreds of
   fault reports every second
 * fmd is flooded by multipath failover ereports like:

 ...
 May 29 21:11:44.9408 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9423 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.8474 ereport.io.scsi.cmd.disk.recovered
 May 29 21:11:44.9455 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9457 ereport.io.scsi.cmd.disk.dev.rqs.derr
 May 29 21:11:44.9462 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9527 ereport.io.scsi.cmd.disk.tran
 May 29 21:11:44.9535 ereport.io.scsi.cmd.disk.dev.rqs.derr
 May 29 21:11:44.6362 ereport.io.scsi.cmd.disk.recovered
 ...



 I suspect that multipath is something not exactly very happy with my
 Toshiba disks, but I have no idea what to do to make it work at least
 somehow acceptably. I tried messing with scsi_vhci.conf to try and set
 load-balance=none, change the scsi-vhci-failover-override for the
 Toshiba disks to f_asym_lsi, flashing the latest as well as old firmware
 in the cards, reseating them to other PCI-e slots, removing one cable
 and even removing one whole HBA, unloading the eft fmd module etc, but
 nothing helped so far and I'm sort of out of ideas. Anybody else got an
 idea on what I might try?
 
 Those ereports are consistent with faulty cabling. You can trace all of the
 cables and errors using tools like lsiutil, sg_logs, kstats, etc. 
 Unfortunately,
 it is not really possible to get into this level of detail over email, and it 
 can
 consume many hours.
  -- richard

That's actually a pretty good piece of information for me! I will try
changing my cabling to see if I can get the errors to go away. Thanks
again for the suggestions!

Cheers
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-30 Thread Sašo Kiselkov
On 05/30/2012 10:53 PM, Richard Elling wrote:
 Those ereports are consistent with faulty cabling. You can trace all of the
 cables and errors using tools like lsiutil, sg_logs, kstats, etc. 
 Unfortunately,
 it is not really possible to get into this level of detail over email, and it 
 can
 consume many hours.
  -- richard

And it turns out you were right. Looking at errors using iostat -E while
manipulating the path taken by the data using mpathadm clearly shows
that one of the paths is faulty. Thanks again for pointing me in the
right direction!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.
 
 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.
 
 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...

If the drives show up at all, chances are you only need to work around
the power-up issue in Dell HDD firmware.

Here's what I had to do to get the drives going in my R515:
/kernel/drv/sd.conf

sd-config-list = SEAGATE ST3300657SS, power-condition:false,
 SEAGATE ST2000NM0001, power-condition:false;

(that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on
your drive model the strings might differ)

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 11:48 AM, Ian Collins wrote:
 On 05/28/12 08:55 PM, Sašo Kiselkov wrote:
 On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.

 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.

 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...
 If the drives show up at all, chances are you only need to work around
 the power-up issue in Dell HDD firmware.

 Here's what I had to do to get the drives going in my R515:
 /kernel/drv/sd.conf

 sd-config-list = SEAGATE ST3300657SS, power-condition:false,
   SEAGATE ST2000NM0001, power-condition:false;

 (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on
 your drive model the strings might differ)
 
 How would that work when the drive type is unknown (to format)?  I
 assumed if sd knows the type, so will format.

Simply take out the drive and have a look at the label.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 12:59 PM, Ian Collins wrote:
 On 05/28/12 10:53 PM, Sašo Kiselkov wrote:
 On 05/28/2012 11:48 AM, Ian Collins wrote:
 On 05/28/12 08:55 PM, Sašo Kiselkov wrote:
 On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.

 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.

 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...
 If the drives show up at all, chances are you only need to work around
 the power-up issue in Dell HDD firmware.

 Here's what I had to do to get the drives going in my R515:
 /kernel/drv/sd.conf

 sd-config-list = SEAGATE ST3300657SS, power-condition:false,
SEAGATE ST2000NM0001, power-condition:false;

 (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on
 your drive model the strings might differ)
 How would that work when the drive type is unknown (to format)?  I
 assumed if sd knows the type, so will format.
 Simply take out the drive and have a look at the label.
 
 Tricky when the machine is on a different continent!
 
 Joking aside, *I* know what the drive is, the OS as far as I can tell
 doesn't.

Can you have a look at your /var/adm/messages or dmesg to check whether
the OS is complaining about failed to power up on the relevant drives?
If yes, then the above fix should work for you, all you need to do is
determine the exact manufacturer and model to enter into sd.conf and
reload the driver via update_drv -vf sd.

Cheers
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-28 Thread Sašo Kiselkov
On 05/28/2012 01:12 PM, Ian Collins wrote:
 On 05/28/12 11:01 PM, Sašo Kiselkov wrote:
 On 05/28/2012 12:59 PM, Ian Collins wrote:
 On 05/28/12 10:53 PM, Sašo Kiselkov wrote:
 On 05/28/2012 11:48 AM, Ian Collins wrote:
 On 05/28/12 08:55 PM, Sašo Kiselkov wrote:
 On 05/28/2012 10:48 AM, Ian Collins wrote:
 To follow up, the H310 appears to be useless in non-raid mode.

 The drives do show up in Solaris 11 format, but they show up as
 unknown, unformatted drives.  One oddity is the box has two SATA
 SSDs which also show up the card's BIOS, but present OK to
 Solaris.

 I'd like to re-FLASH the cards, but I don't think Dell would be
 too happy with me doing that on an evaluation system...
 If the drives show up at all, chances are you only need to work
 around
 the power-up issue in Dell HDD firmware.

 Here's what I had to do to get the drives going in my R515:
 /kernel/drv/sd.conf

 sd-config-list = SEAGATE ST3300657SS, power-condition:false,
 SEAGATE ST2000NM0001, power-condition:false;

 (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives,
 depending on
 your drive model the strings might differ)
 How would that work when the drive type is unknown (to format)?  I
 assumed if sd knows the type, so will format.
 Simply take out the drive and have a look at the label.
 Tricky when the machine is on a different continent!

 Joking aside, *I* know what the drive is, the OS as far as I can tell
 doesn't.
 Can you have a look at your /var/adm/messages or dmesg to check whether
 the OS is complaining about failed to power up on the relevant drives?
 If yes, then the above fix should work for you, all you need to do is
 determine the exact manufacturer and model to enter into sd.conf and
 reload the driver via update_drv -vf sd.
 
 Yes I do see that warning for the non-raid drives.
 
 The problem is I'm booting from a remote ISO image, so I can't alter
 /kernel/drv/sd.conf.
 
 I'll play more tomorrow, typing on a remote console inside an RDP
 session running in a VNC session on a virtual machine is interesting :)

I'm not sure about the Solaris 11 installer, but OpenIndiana's installer
runs from a ramdisk, so theoretically that should be doable. Other than
that you could do it by copying the contents of /kernel from the ISO
into a ramdrive and mounting that in place of /kernel and then issue the
reload command. In any case, you seem to be having exactly the same
issue as I did, so all you need to do is the above magic.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?

2012-05-27 Thread Sašo Kiselkov
On 05/07/2012 05:42 AM, Greg Mason wrote:
 I am currently trying to get two of these things running Illumian. I don't 
 have any particular performance requirements, so I'm thinking of using some 
 sort of supported hypervisor, (either RHEL and KVM or VMware ESXi) to get 
 around the driver support issues, and passing the disks through to an 
 Illumian guest.
 
 The H310 does indeed support pass-through (the non-raid mode), but one thing 
 to keep in mind is that I was only able to configure a single boot disk. I 
 configured the rear two drives into a hardware raid 1 and set the virtual 
 disk as the boot disk so that I can still boot the system if an OS disk fails.
 
 Once Illumos is better supported on the R720 and the PERC H310, I plan to get 
 rid of the hypervisor silliness and run Illumos on bare metal.

How about reflashing LSI firmware to the card? I read on Dell's spec
sheets that the card runs an LSISAS2008 chip, so chances are that
standard LSI firmware will work on it. I can send you all the required
bits to do the reflash, if you like.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] MPxIO n00b question

2012-05-25 Thread Sašo Kiselkov
I'm currently trying to get a SuperMicro JBOD with dual SAS expander
chips running in MPxIO, but I'm a total amateur to this and would like
to ask about how to detect whether MPxIO is working (or not).

My SAS topology is:

 *) One LSI SAS2008-equipped HBA (running the latest IT firmware from
LSI) with two external ports.
 *) Two SAS cables running from the HBA to the SuperMicro JBOD, where
they enter the JBOD's rear backplane (which is equipped with two
LSI SAS expander chips).
 *) From the rear backplane, via two internal SAS cables to the front
backplane (also with two SAS expanders on it)
 *) The JBOD is populated with 45 2TB Toshiba SAS 7200rpm drives

The machine also has a PERC H700 for the boot media, configured into a
hardware RAID-1 (on which rpool resides).

Here is the relevant part from cfgadm -al for the MPxIO bits:

c5 scsi-sas connectedconfigured
unknown
c5::dsk/c5t5393D8CB4452d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8C90CF2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2A6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2AAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2BEd0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2C6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2E2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF2F2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF5C6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF28Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF32Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF35Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF35Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF36Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF36Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF52Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF53Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF53Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF312d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF316d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF506d0  disk connectedconfigured
unknown
c5::dsk/c5t5393E8CAF546d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C84F5Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C84FBAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C851EEd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852A6d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852C2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852CAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C852EAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C854BAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C854E2d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C855AAd0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8509Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8520Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8528Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8530Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8531Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8557Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8558Ed0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C8560Ad0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85106d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85222d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85246d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85366d0  disk connectedconfigured
unknown
c5::dsk/c5t5393F8C85636d0  disk connectedconfigured
unknown
c5::es/ses0ESI  connectedconfigured
unknown
c5::es/ses1ESI  connectedconfigured
unknown
c5::smp/expd0  smp  connectedconfigured
unknown
c5::smp/expd1  smp  connectedconfigured
unknown
c6 scsi-sas connectedconfigured
unknown
c6::dsk/c6t5393D8CB4453d0  disk connectedconfigured
unknown
c6::dsk/c6t5393E8C90CF3d0  disk connectedconfigured
unknown
c6::dsk/c6t5393E8CAF2A7d0  disk connectedconfigured
unknown
c6::dsk/c6t5393E8CAF2ABd0  disk 

Re: [zfs-discuss] MPxIO n00b question

2012-05-25 Thread Sašo Kiselkov
On 05/25/2012 07:35 PM, Jim Klimov wrote:
 Sorry I can't comment on MPxIO, except that I thought zfs could by
 itself discern two paths to the same drive, if only to protect
 against double-importing the disk into pool.

Unfortunately, it isn't the same thing. MPxIO provides redundant
signaling to the drives, independent of the storage/RAID layer above
it, so it does have its place (besides simply increasing throughput).

 I am not sure it is a good idea to use such low protection (raidz1)
 with large drives. At least, I was led to believe that after 2Tb in
 size raidz2 is preferable, and raidz3 is optimal due to long
 scrub/resilver times leading to large timeframes that a pool with
 an error is exposed to possible fatal errors (due to
 double-failures with single-protection).

I'd use lower protection if it were available :) The data on that
array is not very important, the primary design parameter is low cost
per MB. We're in a very demanding IO environment, we need large
quantities of high-throughput, high-IOPS storage, but we don't need
stellar reliability. If the pool gets corrupted due to unfortunate
double-drive failure, well, that's tough, but not unbearable (the pool
stores customer channel recordings for nPVR, so nothing critical really).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] MPxIO n00b question

2012-05-25 Thread Sašo Kiselkov
On 05/25/2012 08:40 PM, Richard Elling wrote:
 See the soluion at https://www.illumos.org/issues/644 -- richard

Good Lord, that was it! It never occurred to me that the drives had a
say in this. Thanks a billion!

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
Hi,

I'm getting weird errors while trying to install openindiana 151a on a
Dell R715 with a PERC H200 (based on an LSI SAS 2008). Any time the OS
tries to access the drives (for whatever reason), I get this dumped into
syslog:

genunix: WARNING: Device
/pci@0,0/pci1002,5a18@4/pci10b58424@0/pci10b5,8624@0/pci1028,1f1e@0/iport@40/disk@w5c0f01004ebe,0
failed to power up
genunix: WARNING: Device
/pci@0,0/pci1002,5a18@4/pci10b58424@0/pci10b5,8624@0/pci1028,1f1e@0/iport@80/disk@w5c0f01064e9e,0
failed to power up

(these are two WD 300GB 10k SAS drives)

When this log message shows up, I can see each drive light up the drive
LED briefly and then it turns off, so apparently the OS tried to
initialize the drives, but somehow failed and gave up.

Consequently, when I try and access them in format(1), they show up as
an unknown type and installing openindiana on them fails while the
installer is trying to do fdisk.

Has anybody got any idea what I can do to the controller/drives/whatever
to fix the failed to power up problem? One would think that a LSI SAS
2008 chip would be problem free under Solaris (the server even lists
Oracle Solaris as an officially supported OS), but alas, I have yet to
succeed.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
On 05/16/2012 09:45 AM, Koopmann, Jan-Peter wrote:
 Hi,
 
 are those DELL branded WD disks? DELL tends to manipulate the
 firmware of the drives so that power handling with Solaris fails.
 If this is the case here:
 
 Easiest way to make it work is to modify /kernel/drv/sd.conf and
 add an entry for your specific drive similar to this
 
 sd-config-list= WD  WD2000FYYG,power-condition:false, 
 SEAGATE ST2000NM0001,power-condition:false, SEAGATE
 ST32000644NS,power-condition:false, SEAGATE
 ST91000640SS,power-condition:false;
 
 Naturally you would have to find out the correct drive names. My
 latest version for a R710 with a MD1200 attached is:
 
 sd-config-list=SEAGATE ST2000NM0001,power-condition:false, 
 SEAGATE ST1000NM0001,power-condition:false, SEAGATE
 ST91000640SS,power-condition:false;
 
 
 Are you using the H200 with the base firmware or did you flash it
 to LSI IT? I am not sure that Solaris handles the H200 natively at
 all and if then it will not have direct drive access since the H200
 will only show virtual drives to Solaris/OI will it not?

They are Dell branded WD disks and I haven't done anything to the
HBA's firmware, so that's stock Dell as well. The drives are,
specifically are WD3001BKHG models. The firmware actually does expose
the disks unless they're part of a RAID group, so that should actually
work. I'll try te power-condition workaround you mentioned.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
On 05/16/2012 09:45 AM, Koopmann, Jan-Peter wrote:
 Hi,
 
 are those DELL branded WD disks? DELL tends to manipulate the firmware of
 the drives so that power handling with Solaris fails. If this is the case
 here:
 
 Easiest way to make it work is to modify /kernel/drv/sd.conf and add an
 entry
 for your specific drive similar to this
 
 sd-config-list= WD  WD2000FYYG,power-condition:false,
 SEAGATE ST2000NM0001,power-condition:false,
 SEAGATE ST32000644NS,power-condition:false,
 SEAGATE ST91000640SS,power-condition:false;
 
 Naturally you would have to find out the correct drive names. My latest
 version for a R710 with a MD1200 attached is:
 
 sd-config-list=SEAGATE ST2000NM0001,power-condition:false,
 SEAGATE ST1000NM0001,power-condition:false,
 SEAGATE ST91000640SS,power-condition:false;
 
 
 Are you using the H200 with the base firmware or did you flash it to LSI IT?
 I am not sure that Solaris handles the H200 natively at all and if then it
 will not have direct drive access since the H200 will only show virtual
 drives to Solaris/OI will it not?
 
 Kind regards,
JP
 
 PS: These are not my findings. Cudos to Sergei (tehc...@gmail.com) and
 Niklas Tungström.

One thing came up while trying this - I'm on a text install image
system, so my / is a ramdisk. Any ideas how I can change the sd.conf on
the USB disk or reload the driver configuration on the fly? I tried
looking for the file on the USB drive, but it isn't in the rootfs
(perhaps it's tucked away in some compressed filesystem image). Thanks!

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dell PERC H200: drive failed to power up

2012-05-16 Thread Sašo Kiselkov
On 05/16/2012 10:17 AM, Koopmann, Jan-Peter wrote:
 
 
 One thing came up while trying this - I'm on a text install
 image system, so my / is a ramdisk. Any ideas how I can change
 the sd.conf on the USB disk or reload the driver configuration on
 the fly? I tried looking for the file on the USB drive, but it
 isn't in the rootfs (perhaps it's tucked away in some compressed
 filesystem image). Thanks!
 
 I am by no means a Solaris or OI guru and live from good advice of
 other people and Mr. Google. So sorry. I have no clueŠ

I got lucky at Googling after all and found the relevant command:

# update_drv -vf sd

The PERC H200 card had nothing to do with it, it was all in the crappy
firmware in the HDDs. Simply adding

sd-config-list = WD  WD3001BKHG,power-condition:false;

to my /kernel/drv/sd.conf (as you suggested) and reloading the driver
using update_drv solved it and I could then proceed with the
installation. The installer was even smart enough to install the
customized sd.conf into the new system, so no further tuning was
necessary.

Thanks for the pointers, you saved my bacon.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Windows 8 ReFS (OT)

2012-01-17 Thread Sašo Kiselkov
On 01/17/2012 01:06 AM, David Magda wrote:
 Kind of off topic, but I figured of some interest to the list. There will be 
 a new file system in Windows 8 with some features that we all know and love 
 in ZFS:
 
 As mentioned previously, one of our design goals was to detect and correct 
 corruption. This not only ensures data integrity, but also improves system 
 availability and online operation. Thus, all ReFS metadata is check-summed 
 at the level of a B+ tree page, and the checksum is stored independently 
 from the page itself. [...] Once ReFS detects such a failure, it interfaces 
 with Storage Spaces to read all available copies of data and chooses the 
 correct one based on checksum validation. It then tells Storage Spaces to 
 fix the bad copies based on the good copies. All of this happens 
 transparently from the point of view of the application.

Looks like what the Btrfs people were trying to do.

--
S
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2012-01-06 Thread Sašo Kiselkov
On 07/01/2011 12:01 AM, Sašo Kiselkov wrote:
 On 06/30/2011 11:56 PM, Sašo Kiselkov wrote:
 Hm, it appears I'll have to do some reboots and more extensive testing.
 I tried tuning various settings and then returned everything back to the
 defaults. Yet, now I can ramp the number of concurrent output streams to
 ~170 instead of the original 95 (even then the pool still has capacity
 left, I'm actually running out of CPU power). The txg commit occurs at
 roughly every 15 (or so) seconds, which is what I wanted. Strange that
 this occurs even after I returned everything to the defaults... I'll try
 doing some more testing on this once I move the production deployment to
 a different system and I'll have more time to experiment with this
 machine. Anyways, thanks for the suggestions, it helped a lot.

 Regards,
 --
 Saso
 
 Just a follow correction: one parameter was indeed changed:
 zfs_write_limit_inflated. In the source it's set to zero, I've set it to
 0x2.

So it seems I was wrong after all and it didn't help. So the question
remains: is there a way how to force ZFS *NOT* to commit a txg before a
certain minimum amount of data has accumulated in it, or before the txg
timeout is reached?

All the best,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HP JBOD D2700 - ok?

2011-11-30 Thread Sašo Kiselkov
On 11/30/2011 02:40 PM, Edmund White wrote:
 Absolutely. 
 
 I'm using a fully-populated D2700 with an HP ProLiant DL380 G7 server
 running NexentaStor.
 
 On the HBA side, I used the LSI 9211-8i 6G controllers for the server's
 internal disks (boot, a handful of large disks, Pliant SSDs for L2Arc).
 There is also a DDRDrive for ZIL. To connect to the D2700 enclosure, I
 used 2 x LSI 9205 6G HBAs; one 4-lane SAS cable per storage controller on
 the D2700.
 
 These were setup with MPxIO (dual controllers, dual paths, dual-ported
 disks) and required a slight bit of tuning of /kernel/drv/scsi_vhci.conf,
 but the performance is great now. The enclosure is supported and I've been
 able to setup drive slot maps and control disk LED's, etc.
 

Coincidentally, I'm also thinking about getting a few D2600 enclosures,
but I've been considering attaching them via a pair of HP SC08Ge 6G SAS
HBAs. Has anybody had any experience with these HBAs? According to a few
searches on the Internet, it should be a rebranded LSI9200-8e.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-30 Thread Sašo Kiselkov
On 06/30/2011 01:10 PM, Jim Klimov wrote:
 2011-06-30 11:47, Sašo Kiselkov пишет:
 On 06/30/2011 02:49 AM, Jim Klimov wrote:
 2011-06-30 2:21, Sašo Kiselkov пишет:
 On 06/29/2011 02:33 PM, Sašo Kiselkov wrote:
 Also there is a buffer-size limit, like this (384Mb):
 set zfs:zfs_write_limit_override = 0x1800

 or on command-line like this:
 # echo zfs_write_limit_override/W0t402653184 | mdb -kw
 Currently my value for this is 0. How should I set it? I'm writing
 ~15MB/s and would like txg flushes to occur at most once every 10
 seconds. Should I set it to 150MB then?

 We had similar spikes with big writes to a Thumper with SXCE in the
 pre-90's
 builds, when the system would stall for seconds while flushing a
 30-second TXG
 full of data. Adding a reasonable megabyte limit solved the
 unresponsiveness
 problem for us, by making these flush-writes rather small and quick.
 I need to do the opposite - I don't need to shorten the interval of
 writes, I need to increase it. Can I do that using
 zfs_write_limit_override?
 Just as a folloup, I've had a look at the tunables in dsl_pool.c and
 found that I could potentially influence the write pressure calculation
 by tuning zfs_txg_synctime_ms - do you think increasing this value from
 its default (1000ms) help me lower the write scheduling frequency? (I
 don't mind if a txg write takes even twice as long, my application
 buffers are on average 6 seconds long.)

 Regards,
 -- 
 Saso
 It might help. In my limited testing on oi_148a,
 it seems that zfs_txg_synctime_ms and zfs_txg_timeout
 are linked somehow (i.e. changing one value changed the
 other accordingly). So in effect they may be two names
 for the same tunable (one in single units of secs, another
 in thousands of msecs).
 Well, to my understanding, zfs_txg_timeout is the timer limit on
 flushing pending txgs to disk - if the timer fires the current txg is
 written to disk regardless of its size. Otherwise the txg scheduling
 algorithm should take into account I/O pressure on the pool, estimate
 the remaining write bandwidth and fire when it estimates that a txg
 commit would overflow zfs_txg_synctime[_ms]. I tried increasing this
 value to 2000 or 3000, but without an effect - prehaps I need to set it
 at pool mount time or in /etc/system. Could somebody with more knowledge
 of these internals please chime in?
 
 
 Somewhere in our discussion the Reply-to-all was lost.
 Back to the list :)
 
 Saso: Did you try setting both the timeout limit and the
 megabyte limit values, and did you see system IO patterns
 correlate with these values?
 
 My understanding was lke yours above, so if things are
 different in reality - I'm interested to know too.
 
 PS: I don't think you wrote: which OS version do you use?

Thanks for the suggestions, I'll try them out. I'm running oi_148.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-30 Thread Sašo Kiselkov
On 06/30/2011 01:33 PM, Jim Klimov wrote:
 2011-06-30 15:22, Sašo Kiselkov пишет:
 I tried increasing this
 value to 2000 or 3000, but without an effect - prehaps I need to set it
 at pool mount time or in /etc/system. Could somebody with more
 knowledge
 of these internals please chime in?

 
 And about this part - it was my understanding and experience
 (from SXCE) that these values can be set at run-time and are
 used as soon as set (or maybe in a few TXGs - but visibly in
 real-time).
 
 Also I've seen instant result from setting the TXG sync times
 on oi_148a with little loads (in my thread about trying to
 account for some 2Mb writes to my root pool) - this could be
 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG
 timeout currently set value.
 

Hm, it appears I'll have to do some reboots and more extensive testing.
I tried tuning various settings and then returned everything back to the
defaults. Yet, now I can ramp the number of concurrent output streams to
~170 instead of the original 95 (even then the pool still has capacity
left, I'm actually running out of CPU power). The txg commit occurs at
roughly every 15 (or so) seconds, which is what I wanted. Strange that
this occurs even after I returned everything to the defaults... I'll try
doing some more testing on this once I move the production deployment to
a different system and I'll have more time to experiment with this
machine. Anyways, thanks for the suggestions, it helped a lot.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-30 Thread Sašo Kiselkov
On 06/30/2011 11:56 PM, Sašo Kiselkov wrote:
 On 06/30/2011 01:33 PM, Jim Klimov wrote:
 2011-06-30 15:22, Sašo Kiselkov пишет:
 I tried increasing this
 value to 2000 or 3000, but without an effect - prehaps I need to set it
 at pool mount time or in /etc/system. Could somebody with more
 knowledge
 of these internals please chime in?


 And about this part - it was my understanding and experience
 (from SXCE) that these values can be set at run-time and are
 used as soon as set (or maybe in a few TXGs - but visibly in
 real-time).

 Also I've seen instant result from setting the TXG sync times
 on oi_148a with little loads (in my thread about trying to
 account for some 2Mb writes to my root pool) - this could be
 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG
 timeout currently set value.

 
 Hm, it appears I'll have to do some reboots and more extensive testing.
 I tried tuning various settings and then returned everything back to the
 defaults. Yet, now I can ramp the number of concurrent output streams to
 ~170 instead of the original 95 (even then the pool still has capacity
 left, I'm actually running out of CPU power). The txg commit occurs at
 roughly every 15 (or so) seconds, which is what I wanted. Strange that
 this occurs even after I returned everything to the defaults... I'll try
 doing some more testing on this once I move the production deployment to
 a different system and I'll have more time to experiment with this
 machine. Anyways, thanks for the suggestions, it helped a lot.
 
 Regards,
 --
 Saso

Just a follow correction: one parameter was indeed changed:
zfs_write_limit_inflated. In the source it's set to zero, I've set it to
0x2.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-29 Thread Sašo Kiselkov
On 06/29/2011 02:33 PM, Sašo Kiselkov wrote:
 Also there is a buffer-size limit, like this (384Mb):
 set zfs:zfs_write_limit_override = 0x1800

 or on command-line like this:
 # echo zfs_write_limit_override/W0t402653184 | mdb -kw
 
 Currently my value for this is 0. How should I set it? I'm writing
 ~15MB/s and would like txg flushes to occur at most once every 10
 seconds. Should I set it to 150MB then?
 
 We had similar spikes with big writes to a Thumper with SXCE in the pre-90's
 builds, when the system would stall for seconds while flushing a 30-second 
 TXG
 full of data. Adding a reasonable megabyte limit solved the unresponsiveness
 problem for us, by making these flush-writes rather small and quick.
 
 I need to do the opposite - I don't need to shorten the interval of
 writes, I need to increase it. Can I do that using zfs_write_limit_override?

Just as a folloup, I've had a look at the tunables in dsl_pool.c and
found that I could potentially influence the write pressure calculation
by tuning zfs_txg_synctime_ms - do you think increasing this value from
its default (1000ms) help me lower the write scheduling frequency? (I
don't mind if a txg write takes even twice as long, my application
buffers are on average 6 seconds long.)

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-29 Thread Sašo Kiselkov
On 06/27/2011 11:59 AM, Jim Klimov wrote:
 
 I'd like to ask about whether there is a method to enforce a 
 certain txg
 commit frequency on ZFS. 
  
 Well, there is a timer frequency based on TXG age (i.e 5 sec 
 by default now), in /etc/system like this:
  
 set zfs:zfs_txg_synctime = 5

When trying to read the value through mdb I get:

# echo zfs_txg_synctime::print | mdb -k
mdb: failed to dereference symbol: unknown symbol name

Is this some new addition in S11E?

 Also there is a buffer-size limit, like this (384Mb):
 set zfs:zfs_write_limit_override = 0x1800
 
 or on command-line like this:
 # echo zfs_write_limit_override/W0t402653184 | mdb -kw

Currently my value for this is 0. How should I set it? I'm writing
~15MB/s and would like txg flushes to occur at most once every 10
seconds. Should I set it to 150MB then?

 We had similar spikes with big writes to a Thumper with SXCE in the pre-90's
 builds, when the system would stall for seconds while flushing a 30-second TXG
 full of data. Adding a reasonable megabyte limit solved the unresponsiveness
 problem for us, by making these flush-writes rather small and quick.

I need to do the opposite - I don't need to shorten the interval of
writes, I need to increase it. Can I do that using zfs_write_limit_override?

Thanks.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fixing txg commit frequency

2011-06-26 Thread Sašo Kiselkov
On 06/26/2011 06:17 PM, Richard Elling wrote:
 
 On Jun 24, 2011, at 5:29 AM, Sašo Kiselkov wrote:
 
 Hi All,

 I'd like to ask about whether there is a method to enforce a certain txg
 commit frequency on ZFS. I'm doing a large amount of video streaming
 from a storage pool while also slowly continuously writing a constant
 volume of data to it (using a normal file descriptor, *not* in O_SYNC).
 When reading volume goes over a certain threshold (and average pool load
 over ~50%), ZFS thinks it's running out of steam on the storage pool and
 starts committing transactions more often which results in even greater
 load on the pool. This leads to a sudden spike in I/O utilization on the
 pool in roughly the following method:

 # streaming clients  pool load [%]
  15  8%
  20 11%
  40 22%
  60 33%
  80 44%
 --- around here txg timeouts start to shorten ---
  85 60%
90   70%
  95 85%
 
 What is a pool load? We expect 100% utilization during the txg commit,
 anything else is a waste.
 
 I suspect that you actually want more, smaller commits to spread the load
 more evenly. This is easy to change, but unless you can tell us what OS
 you are running, including version, we don't have a foundation to build upon.
  -- richard

Pool load is a 60 seconds average of the aggregated util percentages as
reported by iostat -D for the disks which comprise the pool (So I run
iostat -Dn {pool-disks} 60 and compute the load for each row printed
as an average of the util columns). Interestingly enough, when
watching 1-second updates in iostat I never see util hit 100% during a
txg commit, even if it takes two or more seconds to complete. This tells
me that the disks still have enough performance headroom so that zfs
doesn't really need to shorten the interval at which commits occur.

I'm running oi_148, and all pools are zfs version 28.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Fixing txg commit frequency

2011-06-24 Thread Sašo Kiselkov
Hi All,

I'd like to ask about whether there is a method to enforce a certain txg
commit frequency on ZFS. I'm doing a large amount of video streaming
from a storage pool while also slowly continuously writing a constant
volume of data to it (using a normal file descriptor, *not* in O_SYNC).
When reading volume goes over a certain threshold (and average pool load
over ~50%), ZFS thinks it's running out of steam on the storage pool and
starts committing transactions more often which results in even greater
load on the pool. This leads to a sudden spike in I/O utilization on the
pool in roughly the following method:

 # streaming clientspool load [%]
15  8%
20 11%
40 22%
60 33%
80 44%
--- around here txg timeouts start to shorten ---
85 60%
90 70%
95 85%

My application does a fair bit of caching and prefetching, so I have
zfetch disabled and primarycache set to only metadata. Also, reads
happen (on a per client basis) relatively infrequently, so I can easily
take it if the pool stops reading for a few seconds and just writes
data. The problem is, ZFS starts alternating between reads and writes
really quickly, which in turn starves me on IOPS and results in a huge
load spike. Judging on load numbers up to around 80 concurrent clients,
I suspect I could go up to 150 concurrent clients on this pool, but
because of this spike I top out at around 95-100 concurrent clients.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-24 Thread Sašo Kiselkov
On 05/24/2011 03:08 PM, a.sm...@ukgrid.net wrote:
 Hi,
 
   see the seeksize script on this URL:
 
 http://prefetch.net/articles/solaris.dtracetopten.html
 
 Not used it but looks neat!
 
 cheers Andy.

I already did and it does the job just fine. Thank you for your kind
suggestion.

BR,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-20 Thread Sašo Kiselkov
On 05/19/2011 07:47 PM, Richard Elling wrote:
 On May 19, 2011, at 5:35 AM, Sašo Kiselkov wrote:
 
 Hi all,

 I'd like to ask whether there is a way to monitor disk seeks. I have an
 application where many concurrent readers (50) sequentially read a
 large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
 read/write ops using iostat, but that doesn't tell me how contiguous the
 data is, i.e. when iostat reports 500 read ops, does that translate to
 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!
 
 In general, this is hard to see from the OS.  In Solaris, the default I/O
 flowing through sd gets sorted based on LBA before being sent to the
 disk. If the disks gets more than 1 concurrent I/O request (10 is the default
 for Solaris-based ZFS) then the disk can resort or otherwise try to optimize
 the media accesses.
 
 As others have mentioned, iopattern is useful for looking a sequential 
 patterns. I've made some adjustments for the version at
 http://www.richardelling.com/Home/scripts-and-programs-1/iopattern
 
 You can see low-level SCSI activity using scsi.d, but I usually uplevel that
 to using iosnoop -Dast which shows each I/O and its response time.
 Note that the I/Os can complete out-of-order on many devices. The only 
 device I know that is so fast and elegant that it always completes in-order 
 is the DDRdrive.
 
 For detailed analysis of iosnoop data, you will appreciate a real statistics
 package. I use JMP, but others have good luck with R.
  -- richard

Thank you, the iopattern script seems to be quite close to what I
wanted. The percentage split between random and sequential I/O is pretty
much what I needed to know.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Monitoring disk seeks

2011-05-19 Thread Sašo Kiselkov
Hi all,

I'd like to ask whether there is a way to monitor disk seeks. I have an
application where many concurrent readers (50) sequentially read a
large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
read/write ops using iostat, but that doesn't tell me how contiguous the
data is, i.e. when iostat reports 500 read ops, does that translate to
500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring disk seeks

2011-05-19 Thread Sašo Kiselkov
On 05/19/2011 03:35 PM, Tomas Ögren wrote:
 On 19 May, 2011 - Sa??o Kiselkov sent me these 0,6K bytes:
 
 Hi all,

 I'd like to ask whether there is a way to monitor disk seeks. I have an
 application where many concurrent readers (50) sequentially read a
 large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor
 read/write ops using iostat, but that doesn't tell me how contiguous the
 data is, i.e. when iostat reports 500 read ops, does that translate to
 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks!
 
 Get DTraceToolkit and check out the various things under Disk and FS,
 might help.
 
 /Tomas

Thank you all for the tips, I'll try to poke around using the DTrackToolkit.

Regards,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-09 Thread Sašo Kiselkov
On 04/09/2011 01:41 PM, Edward Ned Harvey wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Julian King

 Actually I think our figures more or less agree. 12 disks = 7 mbits
 48 disks = 4x7mbits
 
 I know that sounds like terrible performance to me.  Any time I benchmark
 disks, a cheap generic SATA can easily sustain 500Mbit, and any decent drive
 can easily sustain 1Gbit.

I think he mistyped and meant 7gbit/s.

 Of course it's lower when there's significant random seeking happening...
 But if you have a data model which is able to stream sequentially, the above
 is certainly true.

Unfortunately, this is exactly my scenario, where I want to stream large
volumes of data in many concurrent threads over large datasets which
have no hope of fitting in RAM or L2ARC and with generally very little
locality.

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Sašo Kiselkov
On 04/08/2011 05:20 PM, Mark Sandrock wrote:
 
 On Apr 8, 2011, at 7:50 AM, Evaldas Auryla evaldas.aur...@edqm.eu wrote:
 
 On 04/ 8/11 01:14 PM, Ian Collins wrote:
 You have built-in storage failover with an AR cluster;
 and they do NFS, CIFS, iSCSI, HTTP and WebDav
 out of the box.

 And you have fairly unlimited options for application servers,
 once they are decoupled from the storage servers.

 It doesn't seem like much of a drawback -- although it
 may be for some smaller sites. I see AR clusters going in
 in local high schools and small universities.

 Which is all fine and dandy if you have a green field, or are willing to
 re-architect your systems.  We just wanted to add a couple more x4540s!


 Hi, same here, it's a sad news that Oracle decided to stop x4540s production 
 line. Before, ZFS geeks had choice - buy 7000 series if you want quick out 
 of the box storage with nice GUI, or build your own storage with x4540 
 line, which by the way has brilliant engineering design, the choice is gone 
 now.
 
 Okay, so what is the great advantage
 of an X4540 versus X86 server plus
 disk array(s)?
 
 Mark

Several:

 1) Density: The X4540 has far greater density than 1U server + Sun's
J4200 or J4400 storage arrays. The X4540 did 12 disks / 1RU, whereas a
1U + 2xJ4400 only manages ~5.3 disks / 1RU.

 2) Number of components involved: server + disk enclosure means you
have more PSUs which can die on you, more cabling to accidentally
disconnect and generally more hassle with installation.

 3) Spare management: With the X4540 you only have to have one kind of
spare component: the server. With servers + enclosures, you might need
to keep multiple.

I agree that besides 1), both 2) a 3) are a relatively trivial problem
to solve. Of course, server + enclosure builds do have their place, such
as when you might need to scale, but even then you could just hook them
up to a X4540 (or purchase a new one - I never quite understood why the
storage-enclosure-only variant of the X4540 case was more expensive than
an identical server).

In short, I think the X4540 was an elegant and powerful system that
definitely had its market, especially in my area of work (digital video
processing - heavy on latency, throughput and IOPS - an area, where the
7000-series with its over-the-network access would just be a totally
useless brick).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Sašo Kiselkov
On 04/08/2011 06:59 PM, Darren J Moffat wrote:
 On 08/04/2011 17:47, Sašo Kiselkov wrote:
 In short, I think the X4540 was an elegant and powerful system that
 definitely had its market, especially in my area of work (digital video
 processing - heavy on latency, throughput and IOPS - an area, where the
 7000-series with its over-the-network access would just be a totally
 useless brick).
 
 As an engineer I'm curious have you actually tried a suitably sized
 S7000 or are you assuming it won't perform suitably for you ?
 

No, I haven't tried a S7000, but I've tried other kinds of network
storage and from a design perspective, for my applications, it doesn't
even make a single bit of sense. I'm talking about high-volume real-time
video streaming, where you stream 500-1000 (x 8Mbit/s) live streams from
a machine over UDP. Having to go over the network to fetch the data from
a different machine is kind of like building a proxy which doesn't
really do anything - if the data is available from a different machine
over the network, then why the heck should I just put another machine in
the processing path? For my applications, I need a machine with as few
processing components between the disks and network as possible, to
maximize throughput, maximize IOPS and minimize latency and jitter.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Network video streaming [Was: Re: X4540 no next-gen product?]

2011-04-08 Thread Sašo Kiselkov
On 04/08/2011 07:22 PM, J.P. King wrote:
 
 No, I haven't tried a S7000, but I've tried other kinds of network
 storage and from a design perspective, for my applications, it doesn't
 even make a single bit of sense. I'm talking about high-volume real-time
 video streaming, where you stream 500-1000 (x 8Mbit/s) live streams from
 a machine over UDP. Having to go over the network to fetch the data from
 a different machine is kind of like building a proxy which doesn't
 really do anything - if the data is available from a different machine
 over the network, then why the heck should I just put another machine in
 the processing path? For my applications, I need a machine with as few
 processing components between the disks and network as possible, to
 maximize throughput, maximize IOPS and minimize latency and jitter.
 
 I can't speak for this particular situation or solution, but I think in
 principle you are wrong.  Networks are fast.  Hard drives are slow.  Put
 a 10G connection between your storage and your front ends and you'll
 have the bandwidth[1].  Actually if you really were hitting 1000x8Mbits
 I'd put 2, but that is just a question of scale.  In a different
 situation I have boxes which peak at around 7 Gb/s down a 10G link (in
 reality I don't need that much because it is all about the IOPS for
 me).  That is with just twelve 15k disks.  Your situation appears to be
 pretty ideal for storage hardware, so perfectly achievable from an
 appliance.

I envision this kind of scenario (using my fancy ASCII art skills :-)):

|| = streaming server  ||
+-+ SAS  +-+ PCI-e +-+ Ethernet ++
|DISKS| === | RAM |  | NIC | === | client |
+-+  +-+   +-+  ++

And you are advocating for this kind of scenario:

||  network storage = ||
+-+ SAS  +-+ PCI-e +-+ Ethernet
|DISKS| === | RAM |  | NIC |  ...
+-+  +-+   +-+

|| = streaming server == ||
+-+ PCI-e +-+ PCI-e +-+ Ethernet ++
... == | NIC |  | RAM |  | NIC | === | client |
+-+   +-+   +-+  ++

I'm not constrained on CPU (so hooking up multiple streaming servers to
one backend storage doesn't really make sense).
So what exactly what does this scenario add to my needs (besides needing
extra hardware in both the storage and server (10G NICs, cabling,
modules, etc.)? I'm not saying no, I'd love to improve the throughput,
IOPS and latency characteristics of my systems.

 I can't speak for the S7000 range.  I ignored that entire product line
 because when I asked about it the markup was insane compared to just
 buying X4500/X4540s.  The price for Oracle kit isn't remotely tenable, so
 the death of the X45xx range is a moot point for me anyway, since I
 couldn't afford it.
 
 [1] Just in case, you also shouldn't be adding any particularly
 significant latency either.  Jitter, maybe, depending on the specifics
 of the streams involved.
 
 Saso
 
 Julian
 -- 
 Julian King
 Computer Officer, University of Cambridge, Unix Support
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Sašo Kiselkov
On 01/07/2011 10:26 AM, Darren J Moffat wrote:
 On 06/01/2011 23:07, David Magda wrote:
 On Jan 6, 2011, at 15:57, Nicolas Williams wrote:

 Fletcher is faster than SHA-256, so I think that must be what you're
 asking about: can Fletcher+Verification be faster than
 Sha256+NoVerification?  Or do you have some other goal?

 Would running on recent T-series servers, which have have on-die
 crypto units, help any in this regard?

 The on chip SHA-256 implementation is not yet used see:

 http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via

 Note that the fix I integrated only uses a software implementation of
 SHA256 on the T5120 (UltraSPARC T2) and is not (yet) using the on CPU
 hardware implementation of SHA256.  The reason for this is to do with
 boot time availability of the Solaris Cryptographic Framework and the
 need to have ZFS as the root filesystem.

 Not yet changed it turns out to be quite complicated to fix due to
 very early boot issues.

Would it be difficult to implement both methods and allow ZFS to switch
to the hardware-accelerated crypto backend at runtime after it has been
brought up and initialized? It seems like one heck of a feature
(essentially removing most of the computational complexity of dedup).

--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Sašo Kiselkov
On 01/07/2011 01:15 PM, Darren J Moffat wrote:
 On 07/01/2011 11:56, Sašo Kiselkov wrote:
 On 01/07/2011 10:26 AM, Darren J Moffat wrote:
 On 06/01/2011 23:07, David Magda wrote:
 On Jan 6, 2011, at 15:57, Nicolas Williams wrote:

 Fletcher is faster than SHA-256, so I think that must be what you're
 asking about: can Fletcher+Verification be faster than
 Sha256+NoVerification?  Or do you have some other goal?

 Would running on recent T-series servers, which have have on-die
 crypto units, help any in this regard?

 The on chip SHA-256 implementation is not yet used see:

 http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via

 Note that the fix I integrated only uses a software implementation of
 SHA256 on the T5120 (UltraSPARC T2) and is not (yet) using the on CPU
 hardware implementation of SHA256.  The reason for this is to do with
 boot time availability of the Solaris Cryptographic Framework and the
 need to have ZFS as the root filesystem.

 Not yet changed it turns out to be quite complicated to fix due to
 very early boot issues.

 Would it be difficult to implement both methods and allow ZFS to switch
 to the hardware-accelerated crypto backend at runtime after it has been
 brought up and initialized? It seems like one heck of a feature

 Wither it is difficult or not depends on your level of familiarity
 with ZFS, boot and the cryptographic framework ;-)

 For me no it wouldn't be difficult but it still isn't completely trivial.

 (essentially removing most of the computational complexity of dedup).

 Most of the data I've seen on the performance impact of dedup is not
 coming from the SHA256 computation it is mostly about the additional
 IO to deal with the DDT.   Though lowering the overhead that SHA256
 does add is always a good thing.

Well, seeing as all mainline ZFS development is now happening behind
closed doors, all I can really do is ask for features and hope Oracle
implements them :-). Nevertheless, thanks for the clarification.

BR,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


<    1   2