Re: [zfs-discuss] Solaris derivate with the best long-term future
On 07/11/2012 01:51 PM, Eugen Leitl wrote: As a napp-it user who recently needs to upgrade from NexentaCore I recently saw preferred for OpenIndiana live but running under Illumian, NexentaCore and Solaris 11 (Express) as a system recommendation for napp-it. I wonder about the future of OpenIndiana and Illumian, which fork is likely to see the most continued development, in your opinion? I use OpenIndiana personally, since it's the one I'm most familiar with (direct continuation of OpenSolaris tradition). If you need something with commercial support in that spirit, I recommend having a look at OmniOS. Joyent's SmartOS is really interesting, albeit a bit narrow-profile for my taste (plus, its use of NetBSD packaging means I'll have to adapt to a new way of doing things and I like IPS very much). Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 03:39 PM, David Magda wrote: On Tue, July 10, 2012 19:56, Sašo Kiselkov wrote: However, before I start out on a pointless endeavor, I wanted to probe the field of ZFS users, especially those using dedup, on whether their workloads would benefit from a faster hash algorithm (and hence, lower CPU utilization). Developments of late have suggested to me three possible candidates: [...] I'd wait until SHA-3 is announced. It's supposed to happen this year, of which only six months are left: http://csrc.nist.gov/groups/ST/hash/timeline.html http://en.wikipedia.org/wiki/NIST_hash_function_competition It was actually supposed to happen to 2Q, so they're running a little late it seems. I'm not convinced waiting makes much sense. The SHA-3 standardization process' goals are different from ours. SHA-3 can choose to go with something that's slower, but has a higher security margin. I think that absolute super-tight security isn't all that necessary for ZFS, since the hash isn't used for security purposes. We only need something that's fast and has a good pseudo-random output distribution. That's why I looked toward Edon-R. Even though it might have security problems in itself, it's by far the fastest algorithm in the entire competition. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 03:57 PM, Gregg Wonderly wrote: Since there is a finite number of bit patterns per block, have you tried to just calculate the SHA-256 or SHA-512 for every possible bit pattern to see if there is ever a collision? If you found an algorithm that produced no collisions for any possible block bit pattern, wouldn't that be the win? Don't think that, if you can think of this procedure, that the crypto security guys at universities haven't though about it as well? Of course they have. No, simply generating a sequence of random patterns and hoping to hit a match won't do the trick. P.S. I really don't mean to sound smug or anything, but I know one thing for sure: the crypto researchers who propose these algorithms are some of the brightest minds on this topic on the planet, so I would hardly think they didn't consider trivial problems. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 03:58 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Sašo Kiselkov I really mean no disrespect, but this comment is so dumb I could swear my IQ dropped by a few tenths of a point just by reading. Cool it please. You say I mean no disrespect and then say something which is clearly disrespectful. I sort of flew off the handle there, and I shouldn't have. It felt like Tomas was misrepresenting my position and putting words in my mouth I didn't say. I certainly didn't mean to diminish the validity of an honest question. Tomas's point is to illustrate that hashing is a many-to-one function. If it were possible to rely on the hash to always be unique, then you could use it as a compression algorithm. He's pointing out that's insane. His comment was not in the slightest bit dumb; if anything, it seems like maybe somebody (or some people) didn't get his point. I understood his point very well and I never argued that hashing always results in unique hash values, which is why I thought he was misrepresenting what I said. So for a full explanation of why hashes aren't usable for compression: 1) they are one-way (kind of bummer for decompression) 2) they operate far below the Shannon limit (i.e. unusable for lossless compression) 3) their output is pseudo-random, so even if we find collisions, we have no way to distinguish which input was the most likely one meant for a given hash value (all are equally probable) A formal proof would of course take longer to construct and would take time that I feel is best spent writing code. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:19 PM, Gregg Wonderly wrote: But this is precisely the kind of observation that some people seem to miss out on the importance of. As Tomas suggested in his post, if this was true, then we could have a huge compression ratio as well. And even if there was 10% of the bit patterns that created non-unique hashes, you could use the fact that a block hashed to a known bit pattern that didn't have collisions, to compress the other 90% of your data. I'm serious about this from a number of perspectives. We worry about the time it would take to reverse SHA or RSA hashes to passwords, not even thinking that what if someone has been quietly computing all possible hashes for the past 10-20 years into a database some where, with every 5-16 character password, and now has an instantly searchable hash-to-password database. This is something very well known in the security community as rainbow tables and a common method to protect against it is via salting. Never use a password hashing scheme which doesn't use salts for exactly the reason you outlined above. Sometimes we ignore the scale of time, thinking that only the immediately visible details are what we have to work with. If no one has computed the hashes for every single 4K and 8K block, then fine. But, if that was done, and we had that data, we'd know for sure which algorithm was going to work the best for the number of bits we are considering. Do you even realize how many 4K or 8K blocks there are?!?! Exactly 2^32768 or 2^65536 respectively. I wouldn't worry about somebody having those pre-hashed ;-) Rainbow tables only work for a very limited subset of data. Speculating based on the theory of the algorithms for random number of bits is just silly. Where's the real data that tells us what we need to know? If you don't trust math, then I there's little I can do to convince you. But remember our conversation the next time you step into a car or get on an airplane. The odds that you'll die on that ride are far higher than that you'll find a random hash collision in a 256-bit hash algorithm... -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:22 PM, Bob Friesenhahn wrote: On Wed, 11 Jul 2012, Sašo Kiselkov wrote: the hash isn't used for security purposes. We only need something that's fast and has a good pseudo-random output distribution. That's why I looked toward Edon-R. Even though it might have security problems in itself, it's by far the fastest algorithm in the entire competition. If an algorithm is not 'secure' and zfs is not set to verify, doesn't that mean that a knowledgeable user will be able to cause intentional data corruption if deduplication is enabled? A user with very little privilege might be able to cause intentional harm by writing the magic data block before some other known block (which produces the same hash) is written. This allows one block to substitute for another. It does seem that security is important because with a human element, data is not necessarily random. Theoretically yes, it is possible, but the practicality of such an attack is very much in doubt. In case this is a concern, however, one can always switch to a more secure hash function (e.g. Skein-512). Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:23 PM, casper@oracle.com wrote: On Tue, 10 Jul 2012, Edward Ned Harvey wrote: CPU's are not getting much faster. But IO is definitely getting faster. It's best to keep ahea d of that curve. It seems that per-socket CPU performance is doubling every year. That seems like faster to me. I think that I/O isn't getting as fast as CPU is; memory capacity and bandwith and CPUs are getting faster. I/O, not so much. (Apart from the one single step from harddisk to SSD; but note that I/O is limited to standard interfaces and as such it is likely be helddown by requiring a new standard. Have you seen one of those SSDs made by FusionIO? Those things fit in a single PCI-e x8 slot and can easily push a sustained rate upward of several GB/s. Do not expect that drives are the be-all end-all to storage. Hybrid storage invalidated the traditional CPU memory fast, disks slow wisdom years ago. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:27 PM, Gregg Wonderly wrote: Unfortunately, the government imagines that people are using their home computers to compute hashes and try and decrypt stuff. Look at what is happening with GPUs these days. People are hooking up 4 GPUs in their computers and getting huge performance gains. 5-6 char password space covered in a few days. 12 or so chars would take one machine a couple of years if I recall. So, if we had 20 people with that class of machine, we'd be down to a few months. I'm just suggesting that while the compute space is still huge, it's not actually undoable, it just requires some thought into how to approach the problem, and then some time to do the computations. Huge space, but still finite… There are certain physical limits which one cannot exceed. For instance, you cannot store 2^256 units of 32-byte quantities in Earth. Even if you used proton spin (or some other quantum property) to store a bit, there simply aren't enough protons in the entire visible universe to do it. You will never ever be able to search a 256-bit memory space using a simple exhaustive search. The reason why our security hashes are so long (256-bits, 512-bits, more...) is because attackers *don't* do an exhaustive search. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:30 PM, Gregg Wonderly wrote: This is exactly the issue for me. It's vital to always have verify on. If you don't have the data to prove that every possible block combination possible, hashes uniquely for the small bit space we are talking about, then how in the world can you say that verify is not necessary? That just seems ridiculous to propose. Do you need assurances that in the next 5 seconds a meteorite won't fall to Earth and crush you? No. And yet, the Earth puts on thousands of tons of weight each year from meteoric bombardment and people have been hit and killed by them (not to speak of mass extinction events). Nobody has ever demonstrated of being able to produce a hash collision in any suitably long hash (128-bits plus) using a random search. All hash collisions have been found by attacking the weaknesses in the mathematical definition of these functions (i.e. some part of the input didn't get obfuscated well in the hash function machinery and spilled over into the result, resulting in a slight, but usable non-randomness). Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:36 PM, Justin Stringfellow wrote: Since there is a finite number of bit patterns per block, have you tried to just calculate the SHA-256 or SHA-512 for every possible bit pattern to see if there is ever a collision? If you found an algorithm that produced no collisions for any possible block bit pattern, wouldn't that be the win? Perhaps I've missed something, but if there was *never* a collision, you'd have stumbled across a rather impressive lossless compression algorithm. I'm pretty sure there's some Big Mathematical Rules (Shannon?) that mean this cannot be. Do you realize how big your lookup dictionary would have to be? -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:39 PM, Ferenc-Levente Juhos wrote: As I said several times before, to produce hash collisions. Or to calculate rainbow tables (as a previous user theorized it) you only need the following. You don't need to reproduce all possible blocks. 1. SHA256 produces a 256 bit hash 2. That means it produces a value on 256 bits, in other words a value between 0..2^256 - 1 3. If you start counting from 0 to 2^256 and for each number calculate the SHA256 you will get at least one hash collision (if the hash algortihm is prefectly distributed) 4. Counting from 0 to 2^256, is nothing else but reproducing all possible bit pattern on 32 bytes It's not about whether one computer is capable of producing the above hashes or not, or whether there are actually that many unique 32 byte bit patterns in the universe. A collision can happen. It's actually not that simple, because in hash collision attacks you're not always afforded the luxury of being able to define your input block. More often than not, you want to modify a previously hashed block in such a fashion that it carries your intended modifications while hashing to the same original value. Say for instance you want to modify a 512-byte message (e.g. an SSL certificate) to point to your own CN. Here your rainbow table, even if you could store it somewhere (you couldn't, btw), would do you little good here. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:54 PM, Ferenc-Levente Juhos wrote: You don't have to store all hash values: a. Just memorize the first one SHA256(0) b. start cointing c. bang: by the time you get to 2^256 you get at least a collision. Just one question: how long do you expect this going to take on average? Come on, do the math! -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 04:56 PM, Gregg Wonderly wrote: So, if I had a block collision on my ZFS pool that used dedup, and it had my bank balance of $3,212.20 on it, and you tried to write your bank balance of $3,292,218.84 and got the same hash, no verify, and thus you got my block/balance and now your bank balance was reduced by 3 orders of magnitude, would you be okay with that? What assurances would you be content with using my ZFS pool? I'd feel entirely safe. There, I said it. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 05:10 PM, David Magda wrote: On Wed, July 11, 2012 09:45, Sašo Kiselkov wrote: I'm not convinced waiting makes much sense. The SHA-3 standardization process' goals are different from ours. SHA-3 can choose to go with something that's slower, but has a higher security margin. I think that absolute super-tight security isn't all that necessary for ZFS, since the hash isn't used for security purposes. We only need something that's fast and has a good pseudo-random output distribution. That's why I looked toward Edon-R. Even though it might have security problems in itself, it's by far the fastest algorithm in the entire competition. Fair enough, though I think eventually the SHA-3 winner will be incorporated into hardware (or at least certain instructions used in the algorithm will). I think waiting a few more weeks/months shouldn't be a big deal, as the winner should be announced Real Soon Now, and then a more informed decision can probably be made. The AES process winner had been announced in October 2000. Considering AES-NI was proposed in March 2008 and first silicon for it appeared around January 2010, I wouldn't hold my breath hoping for hardware SHA-3-specific acceleration getting a widespread foothold for at least another 5-10 years (around 2-3 technology generations). That being said, a lot can be achieved using SIMD instructions, but that doesn't depend on the SHA-3 process in any way. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 05:33 PM, Bob Friesenhahn wrote: On Wed, 11 Jul 2012, Sašo Kiselkov wrote: The reason why I don't think this can be used to implement a practical attack is that in order to generate a collision, you first have to know the disk block that you want to create a collision on (or at least the checksum), i.e. the original block is already in the pool. At that point, you could write a colliding block which would get de-dup'd, but that doesn't mean you've corrupted the original data, only that you referenced it. So, in a sense, you haven't corrupted the original block, only your own collision block (since that's the copy doesn't get written). This is not correct. If you know the well-known block to be written, then you can arrange to write your collision block prior to when the well-known block is written. Therefore, it is imperative that the hash algorithm make it clearly impractical to take a well-known block and compute a collision block. For example, the well-known block might be part of a Windows anti-virus package, or a Windows firewall configuration, and corrupting it might leave a Windows VM open to malware attack. True, but that may not be enough to produce a practical collision for the reason that while you know which bytes you want to attack, these might not line up with ZFS disk blocks (especially the case with Windows VMs which are store in large opaque zvols) - such an attack would require physical access to the machine (at which point you can simply manipulate the blocks directly). -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 05:58 PM, Gregg Wonderly wrote: You're entirely sure that there could never be two different blocks that can hash to the same value and have different content? Wow, can you just send me the cash now and we'll call it even? You're the one making the positive claim and I'm calling bullshit. So the onus is on you to demonstrate the collision (and that you arrived at it via your brute force method as described). Until then, my money stays safely on my bank account. Put up or shut up, as the old saying goes. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 06:23 PM, Gregg Wonderly wrote: What I'm saying is that I am getting conflicting information from your rebuttals here. Well, let's address that then: I (and others) say there will be collisions that will cause data loss if verify is off. Saying that there will be without any supporting evidence to back it up amounts to a prophecy. You say it would be so rare as to be impossible from your perspective. Correct. Tomas says, well then lets just use the hash value for a 4096X compression. You fluff around his argument calling him names. Tomas' argument was, as I understood later, an attempt at sarcasm. Nevertheless, I later explained exactly why I consider the hash-compression claim total and utter bunk: So for a full explanation of why hashes aren't usable for compression: 1) they are one-way (kind of bummer for decompression) 2) they operate far below the Shannon limit (i.e. unusable for lossless compression) 3) their output is pseudo-random, so even if we find collisions, we have no way to distinguish which input was the most likely one meant for a given hash value (all are equally probable) I say, well then compute all the possible hashes for all possible bit patterns and demonstrate no dupes. This assumes it's possible to do so. Frenc made a similar claim and I responded with this question: how long do you expect this going to take on average? Come on, do the math!. I pose the same to you. Find the answer and you'll understand exactly why what you're proposing is impossible. You say it's not possible to do that. Please go on and compute a reduced size of the problem for, say, 2^64 32-byte values (still a laughably small space for the problem, but I'm feeling generous). Here's the amount of storage you'll need: 2^64 * 32 = 524288 Exabytes And that's for a problem that I've reduced for you by 192 orders of magnitude. You see, only when you do the math you realize how off base you are in claiming that pre-computation of hash rainbow tables for generic bit patterns is doable. I illustrate a way that loss of data could cost you money. That's merely an emotional argument where you are trying to attack me by trying to invoke an emotional response from when my ass is on the line. Sorry, that doesn't invalidate the original argument that you can't do rainbow table pre-computation for long bit patterns. You say it's impossible for there to be a chance of me constructing a block that has the same hash but different content. To make sure we're not using ambiguous rhetoric here, allow me to summarize my position: you cannot produce, in practical terms, a hash collision on a 256-bit secure hash algorithm using a brute-force algorithm. Several people have illustrated that 128K to 32bits is a huge and lossy ratio of compression, yet you still say it's viable to leave verify off. Except that we're not talking 128K to 32b, but 128K to 256b. Also, only once you appreciate the mathematics behind the size of the 256-bit pattern space can you understand why leaving verify off is okay. I say, in fact that the total number of unique patterns that can exist on any pool is small, compared to the total, illustrating that I understand how the key space for the algorithm is small when looking at a ZFS pool, and thus could have a non-collision opportunity. This is so profoundly wrong that it leads me to suspect you never took courses on cryptography and/or information theory. The size of your storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even if your pool were the size of a single block, we're talking here about the *mathematical* possibility of hitting on a random block that hashes to the same value. Given a stream of random data blocks (thus simulating an exhaustive brute-force search) and a secure pseudo-random hash function (which has a roughly equal chance of producing any output value for a given input block), you've got only a 10^-77 chance of getting a hash collision. If you don't understand how this works, read a book on digital coding theory. So I can see what perspective you are drawing your confidence from, but I, and others, are not confident that the risk has zero probability. I never said the risk is zero. The risk non-zero, but is so close to zero, that you may safely ignore it (since we take much greater risks on a daily basis without so much as a blink of an eye). I'm pushing you to find a way to demonstrate that there is zero risk because if you do that, then you've, in fact created the ultimate compression factor (but enlarged the keys that could collide because the pool is now virtually larger), to date for random bit patterns, and you've also demonstrated that the particular algorithm is very good for dedup. That would indicate to me, that you can then take that algorithm, and run it inside of ZFS dedup to automatically manage when verify is necessary by detecting when a collision occurs. Do
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 10:06 PM, Bill Sommerfeld wrote: On 07/11/12 02:10, Sašo Kiselkov wrote: Oh jeez, I can't remember how many times this flame war has been going on on this list. Here's the gist: SHA-256 (or any good hash) produces a near uniform random distribution of output. Thus, the chances of getting a random hash collision are around 2^-256 or around 10^-77. I think you're correct that most users don't need to worry about this -- sha-256 dedup without verification is not going to cause trouble for them. But your analysis is off. You're citing the chance that two blocks picked at random will have the same hash. But that's not what dedup does; it compares the hash of a new block to a possibly-large population of other hashes, and that gets you into the realm of birthday problem or birthday paradox. See http://en.wikipedia.org/wiki/Birthday_problem for formulas. So, maybe somewhere between 10^-50 and 10^-55 for there being at least one collision in really large collections of data - still not likely enough to worry about. Yeah, I know, I did this as a quick first-degree approximation. However, the provided range is still very far above the chances of getting a random bit-rot error that even Fletcher won't catch. Of course, that assumption goes out the window if you're concerned that an adversary may develop practical ways to find collisions in sha-256 within the deployment lifetime of a system. sha-256 is, more or less, a scaled-up sha-1, and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see http://en.wikipedia.org/wiki/SHA-1#SHA-1). Of course, this is theoretically possible, however, I do not expect such an attack to be practical within any reasonable time frame of the deployment. In any case, should a realistic need to solve this arise, we can always simply switch hashes (I'm also planning to implement Skein-512/256) and do a recv/send to rewrite everything on disk. PITA? Yes. Serious problem? Don't think so. on a somewhat less serious note, perhaps zfs dedup should contain chinese lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation) which asks the sysadmin to report a detected sha-256 collision to eprint.iacr.org or the like... How about we ask them to report to me instead, like so: 1) Detect collision 2) Report to me 3) ??? 4) Profit! Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] New fast hash algorithm - is it needed?
Hi guys, I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS implementation to supplant the currently utilized sha256. On modern 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much slower than many of the SHA-3 candidates, so I went out and did some testing (details attached) on a possible new hash algorithm that might improve on this situation. However, before I start out on a pointless endeavor, I wanted to probe the field of ZFS users, especially those using dedup, on whether their workloads would benefit from a faster hash algorithm (and hence, lower CPU utilization). Developments of late have suggested to me three possible candidates: * SHA-512: simplest to implement (since the code is already in the kernel) and provides a modest performance boost of around 60%. * Skein-512: overall fastest of the SHA-3 finalists and much faster than SHA-512 (around 120-150% faster than the current sha256). * Edon-R-512: probably the fastest general purpose hash algorithm I've ever seen (upward of 300% speedup over sha256) , but might have potential security problems (though I don't think this is of any relevance to ZFS, as it doesn't use the hash for any kind of security purposes, but only for data integrity dedup). My testing procedure: nothing sophisticated, I took the implementation of sha256 from the Illumos kernel and simply ran it on a dedicated psrset (where possible with a whole CPU dedicated, even if only to a single thread) - I tested both the generic C implementation and the Intel assembly implementation. The Skein and Edon-R implementations are in C optimized for 64-bit architectures from the respective authors (the most up to date versions I could find). All code has been compiled using GCC 3.4.3 from the repos (the same that can be used for building Illumos). Sadly, I don't have access to Sun Studio. Cheers, -- Saso Hash preformances on 10 GB of data gcc (GCC) 3.4.3 (csl-sol210-3_4-20050802) CFLAGS: -O3 -fomit-frame-pointer -m64 MACHINE #1 CPU: dual AMD Opteron 4234 Options: single thread on no-intr whole-CPU psrset Algorithm Result Improvement sha256 (ASM)21.19 cycles/byte (baseline) sha256 (C) 27.66 cycles/byte -23.34% sha512 (ASM)13.48 cycles/byte 57.20% sha512 (C) 17.35 cycles/byte 22.13% Skein-512 (C) 8.95 cycles/byte136.76% Edon-R-512 (C) 4.94 cycles/byte328.94% MACHINE #2 CPU: single AMD Athlon II Neo N36L Options: single thread on no-intr 1-core psrset Algorithm Result Improvement sha256 (ASM)15.68 cycles/byte (baseline) sha256 (C) 18.81 cycles/byte -16.64% sha512 (ASM)9.95 cycles/byte57.59% sha512 (C) 11.84 cycles/byte 32.43% Skein-512 (C) 6.25 cycles/byte150.88% Edon-R-512 (C) 3.66 cycles/byte328.42% MACHINE #3 CPU: dual Intel Xeon E5645 Options: single thread on no-intr whole-CPU psrset Algorithm Result Improvement sha256 (ASM)15.49 cycles/byte (baseline) sha256 (C) 17.90 cycles/byte -13.46% sha512 (ASM)9.88 cycles/byte56.78% sha512 (C) 11.44 cycles/byte 35.40% Skein-512 (C) 6.88 cycles/byte125.15% Edon-R-512 (C) 3.35 cycles/byte362.39% MACHINE #4 CPU: single Intel Xeon E5405 Options: single thread on no-intr 1-core psrset Algorithm Result Improvement sha256 (ASM)17.45 cycles/byte (baseline) sha256 (C) 18.34 cycles/byte -4.85% sha512 (ASM)10.24 cycles/byte 70.41% sha512 (C) 11.72 cycles/byte 48.90% Skein-512 (C) 7.32 cycles/byte138.39% Edon-R-512 (C) 3.86 cycles/byte352.07% MACHINE #5 CPU: dual Intel Xeon E5450 Options: single thread on no-intr whole-CPU psrset Algorithm Result Improvement sha256 (ASM)16.43 cycles/byte (baseline) sha256 (C) 18.50 cycles/byte -11.19% sha512 (ASM)10.37 cycles/byte 58.44% sha512 (C) 11.85 cycles/byte 38.65% Skein-512 (C) 7.38 cycles/byte122.63% Edon-R-512 (C) 3.88 cycles/byte323.45% ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/19/2012 11:05 AM, Sašo Kiselkov wrote: On 06/18/2012 07:50 PM, Roch wrote: Are we hitting : 7167903 Configuring VLANs results in single threaded soft ring fanout Confirmed, it is definitely this. Hold the phone, I just tried unconfiguring all of the VLANs in the system and went to pure interfaces and it didn't help. So while the issue stems from the soft ring fanout, it's probably not caused by VLANs. Thanks for the pointers anyway, though. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/18/2012 12:05 AM, Richard Elling wrote: You might try some of the troubleshooting techniques described in Chapter 5 of the DTtrace book by Brendan Gregg and Jim Mauro. It is not clear from your description that you are seeing the same symptoms, but the technique should apply. -- richard Thanks for the advice, I'll try it. In the mean time, I'm beginning to suspect I'm hitting some PCI-e issue on the Dell R715 machine. Looking at # mdb -k ::interrupts IRQ Vect IPL BusTrg Type CPU Share APIC/INT# ISR(s) .[snip] 91 0x82 7 PCIEdg MSI5 1 - pcieb_intr_handler .[snip]. In mpstat I can see that during normal operation, CPU 5 is nearly floored: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 50 00 5120 105400 8700 00 93 0 7 Then, when anything hits which disturbs the PCI-e bus (e.g. a txg flush or the xcall storm), the CPU goes to 100% utilization and my networking throughput drops accordingly. The issue can be softened by lowering the input bandwidth from ~46MB/s to below 20MB/s - at that point I'm getting only about 10% utilization on the core in question and no xcall storm or txg flush can influence my network (though I do see the CPU get about 70% busy during the process, but still enough left to avoid packet loss). So it seems, I'm hitting some hardware design issue, or something... I'll try moving my network card to the second PCI-e I/O bridge tomorrow (which seems to be bound to CPU 6). Any other ideas on what I might try to get the PCI-e I/O bridge bandwidth back? Or how to fight the starvation of the CPU by other activities in the system? (xcalls and/or txg flushes) I already tried putting the CPUs in question into an empty processor set, but that isn't enough, it seems. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/13/2012 03:43 PM, Roch wrote: Sašo Kiselkov writes: On 06/12/2012 05:37 PM, Roch Bourbonnais wrote: So the xcall are necessary part of memory reclaiming, when one needs to tear down the TLB entry mapping the physical memory (which can from here on be repurposed). So the xcall are just part of this. Should not cause trouble, but they do. They consume a cpu for some time. That in turn can cause infrequent latency bubble on the network. A certain root cause of these latency bubble is that network thread are bound by default and if the xcall storm ends up on the CPU that the network thread is bound to, it will wait for the storm to pass. I understand, but the xcall storm settles only eats up a single core out of a total of 32, plus it's not a single specific one, it tends to change, so what are the odds of hitting the same core as the one on which the mac thread is running? That's easy :-) : 1/32 each time it needs to run. So depending on how often it runs (which depends on how much churn there is in the ARC) and how often you see the latency bubbles, that may or may not be it. What is zio_taskq_batch_pct on your system ? That is another storm bit of code which causes bubble. Setting it down to 50 (versus an older default of 100) should help if it's not done already. -r So I tried all of the suggestions above (mac unbinding, zio_taskq tuning) and none helped. I'm beginning to suspect it has something to do with the networking cards. When I try and snoop filtered traffic from one interface into a file (snoop -o /tmp/dump -rd vlan935 host a.b.c.d), my multicast reception throughput plummets to about 1/3 of the original. I'm running a link-aggregation of 4 on-board Broadcom NICs: # dladm show-aggr -x LINK PORT SPEED DUPLEX STATE ADDRESSPORTSTATE aggr0-- 1000Mb fullupd0:67:e5:fc:bd:38 -- bnx1 1000Mb fullupd0:67:e5:fc:bd:38 attached bnx2 1000Mb fullupd0:67:e5:fc:bd:3a attached bnx3 1000Mb fullupd0:67:e5:fc:bd:3c attached bnx0 1000Mb fullupd0:67:e5:fc:bd:36 attached # dladm show-vlan LINKVID OVER FLAGS vlan49 49 aggr0- vlan934 934 aggr0- vlan935 935 aggr0- Normally, I'm getting around 46MB/s on vlan935, however, once I run any snoop command which puts the network interfaces into promisc mode, my throughput plummets to around 20MB/s. During that I can see context switches skyrocket on 4 CPU cores and them being around 75% busy. Now I understand that snoop has some probe effect, but this is definitely too large. I've never seen this kind of bad behavior before on any of my other Solaris systems (with similar load). Are there any tunings I can make to my network to track down the issue? My module for bnx is: # modinfo | grep bnx 169 f80a7000 63ba0 197 1 bnx (Broadcom NXII GbE 6.0.1) Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks
On 06/15/2012 03:35 PM, Johannes Totz wrote: On 15/06/2012 13:22, Sašo Kiselkov wrote: On 06/15/2012 02:14 PM, Hans J Albertsson wrote: I've got my root pool on a mirror on 2 512 byte blocksize disks. I want to move the root pool to two 2 TB disks with 4k blocks. The server only has room for two disks. I do have an esata connector, though, and a suitable external cabinet for connecting one extra disk. How would I go about migrating/expanding the root pool to the larger disks so I can then use the larger disks for booting? I have no extra machine to use. Suppose we call the disks like so: A, B: your old 512-block drives X, Y: your new 2TB drives The easiest way would be to simply: 1) zpool set autoexpand=on rpool 2) offline the A drive 3) physically replace it with the X drive 4) do a zpool replace on it and wait for it to resilver When sector size differs, attaching it is going to fail (at least on fbsd). You might not get around a send-receive cycle... Jim Klimov has already posted a way better guide, which rebuilds the pool using the old one's data, so yeah, the replace route I recommended here is rendered moot. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
Seems the problem is somewhat more egregious than I thought. The xcall storm causes my network drivers to stop receiving IP multicast packets and subsequently my recording applications record bad data, so ultimately, this kind of isn't workable... I need to somehow resolve this... I'm running four on-board Broadcom NICs in an LACP aggregation. Any ideas on why this might be a side-effect? I'm really kind of out of ideas here... Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/12/2012 03:57 PM, Sašo Kiselkov wrote: Seems the problem is somewhat more egregious than I thought. The xcall storm causes my network drivers to stop receiving IP multicast packets and subsequently my recording applications record bad data, so ultimately, this kind of isn't workable... I need to somehow resolve this... I'm running four on-board Broadcom NICs in an LACP aggregation. Any ideas on why this might be a side-effect? I'm really kind of out of ideas here... Cheers, -- Saso Just as another datapoint, though I'm not sure if it's going to be much use, is that I found (via arcstat.pl) that the storms always start happen when ARC downsizing starts. E.g. I would see the following in ./arcstat.pl 1: Time readdmis dm% pmis pm% mmis mm% arcsz c 16:29:4521 00 00 00 111G 111G 16:29:46 0 00 00 00 111G 111G 16:29:47 1 00 00 00 111G 111G 16:29:48 0 00 00 00 111G 111G 16:29:495K 00 00 00 111G 111G (this is where the problem starts) 16:29:5036 00 00 00 109G 107G 16:29:5151 00 00 00 107G 107G 16:29:5210 00 00 00 107G 107G 16:29:53 148 00 00 00 107G 107G 16:29:545K 00 00 00 107G 107G (and after a while, around 10-15 seconds, it stops) (I omitted the miss and miss% columns to make the rows fit). During the time, the network stack is dropping input IP multicast UDP packets like crazy, so I see my network input drop by about 30-40%. Truly strange behavior... Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/12/2012 05:21 PM, Matt Breitbach wrote: I saw this _exact_ problem after I bumped ram from 48GB to 192GB. Low memory pressure seemed to be the cuplrit. Happened usually during storage vmotions or something like that which effectively nullified the data in the ARC (sometimes 50GB of data would be purged from the ARC). The system was so busy that it would drop 10Gbit LACP portchannels from our Nexus 5k stack. I never got a good solution to this other than to set arc_min_c to something that was close to what I wanted the system to use - I settled on setting it at ~160GB. It still dropped the arcsz, but it didn't try to adjust arc_c and resulted in significantly fewer xcalls. Hmm, how do I do that? I don't have that kind of symbol in the kernel. I'm running OpenIndiana build 151a. My system indeed runs at low memory pressure, I'm simply running a bunch of writers writing files linearly with data they received IP/UDP multicast sockets. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/12/2012 05:37 PM, Roch Bourbonnais wrote: So the xcall are necessary part of memory reclaiming, when one needs to tear down the TLB entry mapping the physical memory (which can from here on be repurposed). So the xcall are just part of this. Should not cause trouble, but they do. They consume a cpu for some time. That in turn can cause infrequent latency bubble on the network. A certain root cause of these latency bubble is that network thread are bound by default and if the xcall storm ends up on the CPU that the network thread is bound to, it will wait for the storm to pass. I understand, but the xcall storm settles only eats up a single core out of a total of 32, plus it's not a single specific one, it tends to change, so what are the odds of hitting the same core as the one on which the mac thread is running? So try unbinding the mac threads; it may help you here. How do I do that? All I can find on interrupt fencing and the like is to simply set certain processors to no-intr, which moves all of the interrupts and it doesn't prevent the xcall storm choosing to affect these CPUs either... -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/12/2012 06:06 PM, Jim Mauro wrote: So try unbinding the mac threads; it may help you here. How do I do that? All I can find on interrupt fencing and the like is to simply set certain processors to no-intr, which moves all of the interrupts and it doesn't prevent the xcall storm choosing to affect these CPUs either… In /etc/system: set mac:mac_soft_ring_thread_bind=0 set mac:mac_srs_thread_bind=0 Reboot required. Verify after reboot with mdb; echo mac_soft_ring_thread_bind/D | mdb -k Trying that right now... thanks! -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote: find where your nics are bound too mdb -k ::interrupts create a processor set including those cpus [ so just the nic code will run there ] andy Tried and didn't help, unfortunately. I'm still seeing drops. What's even funnier is that I'm seeing drops when the machine is sync'ing the txg to the zpool. So looking at a little UDP receiver I can see the following input stream bandwidth (the stream is constant bitrate, so this shouldn't happen): 4.396151 Mbit/s - drop 5.217205 Mbit/s 5.144323 Mbit/s 5.150227 Mbit/s 5.144150 Mbit/s 4.663824 Mbit/s - drop 5.178603 Mbit/s 5.148681 Mbit/s 5.153835 Mbit/s 5.141116 Mbit/s 4.532479 Mbit/s - drop 5.197381 Mbit/s 5.158436 Mbit/s 5.141881 Mbit/s 5.145433 Mbit/s 4.605852 Mbit/s - drop 5.183006 Mbit/s 5.150526 Mbit/s 5.149324 Mbit/s 5.142306 Mbit/s 4.749443 Mbit/s - drop (txg timeout on my system is the default 5s) It isn't just a slight delay in the arrival of the packets, because then I should be seeing a rebound on the bitrate, sort of like this: ^ |-, ,^, ,^-, ,^ B | v vv | +-- t - Instead, what I'm seeing is simply: ^ |-, ,-, ,--, ,- B | v vv | +-- t - (The missing spikes after the drops means that there were lost packets on the NIC.) -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/12/2012 07:19 PM, Roch Bourbonnais wrote: Try with this /etc/system tunings : set mac:mac_soft_ring_thread_bind=0 set mac:mac_srs_thread_bind=0 set zfs:zio_taskq_batch_pct=50 Thanks for the recommendations, I'll try and see whether it helps, but this is going to take me a while (especially since the reboot means I'll have a clear ARC and need to record up again around 120G of data, which takes a while to accumulate). Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm occasionally seeing a storm of xcalls on one of the 32 VCPUs (10 xcalls a second). The machine is pretty much idle, only receiving a bunch of multicast video streams and dumping them to the drives (at a rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of xcalls that completely eat one of the CPUs, so the mpstat line for the CPU looks like: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 310 0 102191 1000000 00 100 0 0 100% busy in the system processing cross-calls. When I tried dtracing this issue, I found that this is the most likely culprit: dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}' unix`xc_call+0x46 unix`hat_tlb_inval+0x283 unix`x86pte_inval+0xaa unix`hat_pte_unmap+0xed unix`hat_unload_callback+0x193 unix`hat_unload+0x41 unix`segkmem_free_vn+0x6f unix`segkmem_zio_free+0x27 genunix`vmem_xfree+0x104 genunix`vmem_free+0x29 genunix`kmem_slab_destroy+0x87 genunix`kmem_slab_free+0x2bb genunix`kmem_magazine_destroy+0x39a genunix`kmem_depot_ws_reap+0x66 genunix`taskq_thread+0x285 unix`thread_start+0x8 3221701 This happens in the sched (pid 0) process. My fsstat one looks like this: # fsstat /content 1 new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 0 0 0 664 0952 0 0 0 664 38.0M /content 0 0 0 658 0935 0 0 0 656 38.6M /content 0 0 0 660 0946 0 0 0 659 37.8M /content 0 0 0 677 0969 0 0 0 676 38.5M /content What's even more puzzling is that this happens apparently entirely because of some factor other than userland, since I see no changes to CPU usage of processes in prstat(1M) when this xcall storm happens, only an increase of loadavg of +1.00 (the busy CPU). I Googled and found that http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html seems to have been an issue identical to mine, however, it remains unresolved at that time and it worries me about putting this kind of machine into production use. Could some ZFS guru please tell me what's going on in segkmem_zio_free? When I disable the writers to the /content filesystem, this issue goes away, so it has obviously something to do with disk IO. Thanks! Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/06/2012 04:55 PM, Richard Elling wrote: On Jun 6, 2012, at 12:48 AM, Sašo Kiselkov wrote: So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm occasionally seeing a storm of xcalls on one of the 32 VCPUs (10 xcalls a second). That isn't much of a storm, I've seen 1M xcalls in some cases... Well it does make one of the cores 100% busy for around 10-15 seconds, so it is processing at the maximum rate the core can do it. I'd call that a sign of something bad(tm) going on. The machine is pretty much idle, only receiving a bunch of multicast video streams and dumping them to the drives (at a rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of xcalls that completely eat one of the CPUs, so the mpstat line for the CPU looks like: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 310 0 102191 1000000 00 100 0 0 100% busy in the system processing cross-calls. When I tried dtracing this issue, I found that this is the most likely culprit: dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}' unix`xc_call+0x46 unix`hat_tlb_inval+0x283 unix`x86pte_inval+0xaa unix`hat_pte_unmap+0xed unix`hat_unload_callback+0x193 unix`hat_unload+0x41 unix`segkmem_free_vn+0x6f unix`segkmem_zio_free+0x27 genunix`vmem_xfree+0x104 genunix`vmem_free+0x29 genunix`kmem_slab_destroy+0x87 genunix`kmem_slab_free+0x2bb genunix`kmem_magazine_destroy+0x39a genunix`kmem_depot_ws_reap+0x66 genunix`taskq_thread+0x285 unix`thread_start+0x8 3221701 This happens in the sched (pid 0) process. My fsstat one looks like this: # fsstat /content 1 new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 0 0 0 664 0952 0 0 0 664 38.0M /content 0 0 0 658 0935 0 0 0 656 38.6M /content 0 0 0 660 0946 0 0 0 659 37.8M /content 0 0 0 677 0969 0 0 0 676 38.5M /content What's even more puzzling is that this happens apparently entirely because of some factor other than userland, since I see no changes to CPU usage of processes in prstat(1M) when this xcall storm happens, only an increase of loadavg of +1.00 (the busy CPU). What exactly is the workload doing? As I wrote above, just receiving multicast video streams and writing them to disk files, nothing else. The fsstat lines above show that - pure write load. Local I/O, iSCSI, NFS, or CIFS? Purely local I/O via the two LSI SAS controllers, nothing else. I Googled and found that http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html seems to have been an issue identical to mine, however, it remains unresolved at that time and it worries me about putting this kind of machine into production use. Could some ZFS guru please tell me what's going on in segkmem_zio_free? It is freeing memory. Yes, but why is this causing a ton of cross-calls? When I disable the writers to the /content filesystem, this issue goes away, so it has obviously something to do with disk IO. Thanks! Not directly related to disk I/O bandwidth. Can be directly related to other use, such as deletions -- something that causes frees. When I'm not writing to disk it doesn't happen, so my guess that it indeed has something to do with (perhaps) ZFS freeing txg buffers or something... Depending on the cause, there can be some tuning that applies for large memory machines, where large is = 96 MB. -- richard I'll try and load the machine with dd(1) to the max to see if access patterns of my software have something to do with it. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/06/2012 05:01 PM, Sašo Kiselkov wrote: I'll try and load the machine with dd(1) to the max to see if access patterns of my software have something to do with it. Tried and tested, any and all write I/O to the pool causes this xcall storm issue, writing more data to it only exacerbates it (i.e. it occurs more often). I still get storms of over 100k xcalls completely draining one CPU core, but now they happen in 20-30s intervals rather than every 1-2 minutes. Writing to the rpool, however, does not, so I suspect it has something to do with the MPxIO and how ZFS is pumping data into the twin LSI 9200 controllers. Each is attached to a different CPU I/O bridge (since the system has two Opterons, it has two I/O bridges, each handling roughly half of the PCI-e links). I did this in the hope of improving performance (since the HT links to the I/O bridges will be more evenly loaded). Any idea of this might be the cause of this issue? The whole system diagram is: CPU --(ht)-- IOB --(pcie)-- LSI 9200 --(sas)-, |\ (ht) == JBOD |/ CPU --(ht)-- IOB --(pcie)-- LSI 9200 --(sas)-' Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On 06/06/2012 09:43 PM, Jim Mauro wrote: I can't help but be curious about something, which perhaps you verified but did not post. What the data here shows is; - CPU 31 is buried in the kernel (100% sys). - CPU 31 is handling a moderate-to-high rate of xcalls. What the data does not prove empirically is that the 100% sys time of CPU 31 is in xcall handling. What's the hot stack when this occurs and you run this; dtrace -n 'profile-997hz /cpu == 31/ { @[stack()] = count(); }' Thanks for pointing this out. I ran the probe you specified and attached are the results (I had to chase the xcalls around a bit, because they were jumping around cores as I was trying to insert the probes). As I suspected, the most numerous stack trace is the one which causes cross calls because of the segkmem_zio_free+0x27 code path. While this was going on, I was getting between 80k and 300k xcalls on the core in question. The next most common stack was the one ending in mach_cpu_idle and then, so I'm not sure why the CPU reported 100% busy (perhaps the xcalls were very expensive on CPU time compared with the 1273 idle's). Cheers, -- Saso xc_call.txt.bz2 Description: application/bzip ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] MPxIO n00b question
On 05/25/2012 08:40 PM, Richard Elling wrote: See the soluion at https://www.illumos.org/issues/644 -- richard And predictably, I'm back with another n00b question regarding this array. I've put a pair of LSI-9200-8e controllers in the server and attached the cables to the enclosure to each of the HBAs. As a result (why?) I'm getting some really strange behavior: * piss poor performance (around 5MB/s per disk tops) * fmd(1M) running one core at near 100% saturation each time something writes or reads from the pool * using fmstat I noticed that its the eft module receiving hundreds of fault reports every second * fmd is flooded by multipath failover ereports like: ... May 29 21:11:44.9408 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9423 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.8474 ereport.io.scsi.cmd.disk.recovered May 29 21:11:44.9455 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9457 ereport.io.scsi.cmd.disk.dev.rqs.derr May 29 21:11:44.9462 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9527 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9535 ereport.io.scsi.cmd.disk.dev.rqs.derr May 29 21:11:44.6362 ereport.io.scsi.cmd.disk.recovered ... I suspect that multipath is something not exactly very happy with my Toshiba disks, but I have no idea what to do to make it work at least somehow acceptably. I tried messing with scsi_vhci.conf to try and set load-balance=none, change the scsi-vhci-failover-override for the Toshiba disks to f_asym_lsi, flashing the latest as well as old firmware in the cards, reseating them to other PCI-e slots, removing one cable and even removing one whole HBA, unloading the eft fmd module etc, but nothing helped so far and I'm sort of out of ideas. Anybody else got an idea on what I might try? Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] MPxIO n00b question
On 05/30/2012 10:53 PM, Richard Elling wrote: On May 30, 2012, at 1:07 PM, Sašo Kiselkov wrote: On 05/25/2012 08:40 PM, Richard Elling wrote: See the soluion at https://www.illumos.org/issues/644 -- richard And predictably, I'm back with another n00b question regarding this array. I've put a pair of LSI-9200-8e controllers in the server and attached the cables to the enclosure to each of the HBAs. As a result (why?) I'm getting some really strange behavior: * piss poor performance (around 5MB/s per disk tops) * fmd(1M) running one core at near 100% saturation each time something writes or reads from the pool * using fmstat I noticed that its the eft module receiving hundreds of fault reports every second * fmd is flooded by multipath failover ereports like: ... May 29 21:11:44.9408 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9423 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.8474 ereport.io.scsi.cmd.disk.recovered May 29 21:11:44.9455 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9457 ereport.io.scsi.cmd.disk.dev.rqs.derr May 29 21:11:44.9462 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9527 ereport.io.scsi.cmd.disk.tran May 29 21:11:44.9535 ereport.io.scsi.cmd.disk.dev.rqs.derr May 29 21:11:44.6362 ereport.io.scsi.cmd.disk.recovered ... I suspect that multipath is something not exactly very happy with my Toshiba disks, but I have no idea what to do to make it work at least somehow acceptably. I tried messing with scsi_vhci.conf to try and set load-balance=none, change the scsi-vhci-failover-override for the Toshiba disks to f_asym_lsi, flashing the latest as well as old firmware in the cards, reseating them to other PCI-e slots, removing one cable and even removing one whole HBA, unloading the eft fmd module etc, but nothing helped so far and I'm sort of out of ideas. Anybody else got an idea on what I might try? Those ereports are consistent with faulty cabling. You can trace all of the cables and errors using tools like lsiutil, sg_logs, kstats, etc. Unfortunately, it is not really possible to get into this level of detail over email, and it can consume many hours. -- richard That's actually a pretty good piece of information for me! I will try changing my cabling to see if I can get the errors to go away. Thanks again for the suggestions! Cheers -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] MPxIO n00b question
On 05/30/2012 10:53 PM, Richard Elling wrote: Those ereports are consistent with faulty cabling. You can trace all of the cables and errors using tools like lsiutil, sg_logs, kstats, etc. Unfortunately, it is not really possible to get into this level of detail over email, and it can consume many hours. -- richard And it turns out you were right. Looking at errors using iostat -E while manipulating the path taken by the data using mpathadm clearly shows that one of the paths is faulty. Thanks again for pointing me in the right direction! Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?
On 05/28/2012 10:48 AM, Ian Collins wrote: To follow up, the H310 appears to be useless in non-raid mode. The drives do show up in Solaris 11 format, but they show up as unknown, unformatted drives. One oddity is the box has two SATA SSDs which also show up the card's BIOS, but present OK to Solaris. I'd like to re-FLASH the cards, but I don't think Dell would be too happy with me doing that on an evaluation system... If the drives show up at all, chances are you only need to work around the power-up issue in Dell HDD firmware. Here's what I had to do to get the drives going in my R515: /kernel/drv/sd.conf sd-config-list = SEAGATE ST3300657SS, power-condition:false, SEAGATE ST2000NM0001, power-condition:false; (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on your drive model the strings might differ) -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?
On 05/28/2012 11:48 AM, Ian Collins wrote: On 05/28/12 08:55 PM, Sašo Kiselkov wrote: On 05/28/2012 10:48 AM, Ian Collins wrote: To follow up, the H310 appears to be useless in non-raid mode. The drives do show up in Solaris 11 format, but they show up as unknown, unformatted drives. One oddity is the box has two SATA SSDs which also show up the card's BIOS, but present OK to Solaris. I'd like to re-FLASH the cards, but I don't think Dell would be too happy with me doing that on an evaluation system... If the drives show up at all, chances are you only need to work around the power-up issue in Dell HDD firmware. Here's what I had to do to get the drives going in my R515: /kernel/drv/sd.conf sd-config-list = SEAGATE ST3300657SS, power-condition:false, SEAGATE ST2000NM0001, power-condition:false; (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on your drive model the strings might differ) How would that work when the drive type is unknown (to format)? I assumed if sd knows the type, so will format. Simply take out the drive and have a look at the label. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?
On 05/28/2012 12:59 PM, Ian Collins wrote: On 05/28/12 10:53 PM, Sašo Kiselkov wrote: On 05/28/2012 11:48 AM, Ian Collins wrote: On 05/28/12 08:55 PM, Sašo Kiselkov wrote: On 05/28/2012 10:48 AM, Ian Collins wrote: To follow up, the H310 appears to be useless in non-raid mode. The drives do show up in Solaris 11 format, but they show up as unknown, unformatted drives. One oddity is the box has two SATA SSDs which also show up the card's BIOS, but present OK to Solaris. I'd like to re-FLASH the cards, but I don't think Dell would be too happy with me doing that on an evaluation system... If the drives show up at all, chances are you only need to work around the power-up issue in Dell HDD firmware. Here's what I had to do to get the drives going in my R515: /kernel/drv/sd.conf sd-config-list = SEAGATE ST3300657SS, power-condition:false, SEAGATE ST2000NM0001, power-condition:false; (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on your drive model the strings might differ) How would that work when the drive type is unknown (to format)? I assumed if sd knows the type, so will format. Simply take out the drive and have a look at the label. Tricky when the machine is on a different continent! Joking aside, *I* know what the drive is, the OS as far as I can tell doesn't. Can you have a look at your /var/adm/messages or dmesg to check whether the OS is complaining about failed to power up on the relevant drives? If yes, then the above fix should work for you, all you need to do is determine the exact manufacturer and model to enter into sd.conf and reload the driver via update_drv -vf sd. Cheers -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?
On 05/28/2012 01:12 PM, Ian Collins wrote: On 05/28/12 11:01 PM, Sašo Kiselkov wrote: On 05/28/2012 12:59 PM, Ian Collins wrote: On 05/28/12 10:53 PM, Sašo Kiselkov wrote: On 05/28/2012 11:48 AM, Ian Collins wrote: On 05/28/12 08:55 PM, Sašo Kiselkov wrote: On 05/28/2012 10:48 AM, Ian Collins wrote: To follow up, the H310 appears to be useless in non-raid mode. The drives do show up in Solaris 11 format, but they show up as unknown, unformatted drives. One oddity is the box has two SATA SSDs which also show up the card's BIOS, but present OK to Solaris. I'd like to re-FLASH the cards, but I don't think Dell would be too happy with me doing that on an evaluation system... If the drives show up at all, chances are you only need to work around the power-up issue in Dell HDD firmware. Here's what I had to do to get the drives going in my R515: /kernel/drv/sd.conf sd-config-list = SEAGATE ST3300657SS, power-condition:false, SEAGATE ST2000NM0001, power-condition:false; (that's for Seagate 300GB 15k SAS and 2TB 7k2 SAS drives, depending on your drive model the strings might differ) How would that work when the drive type is unknown (to format)? I assumed if sd knows the type, so will format. Simply take out the drive and have a look at the label. Tricky when the machine is on a different continent! Joking aside, *I* know what the drive is, the OS as far as I can tell doesn't. Can you have a look at your /var/adm/messages or dmesg to check whether the OS is complaining about failed to power up on the relevant drives? If yes, then the above fix should work for you, all you need to do is determine the exact manufacturer and model to enter into sd.conf and reload the driver via update_drv -vf sd. Yes I do see that warning for the non-raid drives. The problem is I'm booting from a remote ISO image, so I can't alter /kernel/drv/sd.conf. I'll play more tomorrow, typing on a remote console inside an RDP session running in a VNC session on a virtual machine is interesting :) I'm not sure about the Solaris 11 installer, but OpenIndiana's installer runs from a ramdisk, so theoretically that should be doable. Other than that you could do it by copying the contents of /kernel from the ISO into a ramdrive and mounting that in place of /kernel and then issue the reload command. In any case, you seem to be having exactly the same issue as I did, so all you need to do is the above magic. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Has anyone used a Dell with a PERC H310?
On 05/07/2012 05:42 AM, Greg Mason wrote: I am currently trying to get two of these things running Illumian. I don't have any particular performance requirements, so I'm thinking of using some sort of supported hypervisor, (either RHEL and KVM or VMware ESXi) to get around the driver support issues, and passing the disks through to an Illumian guest. The H310 does indeed support pass-through (the non-raid mode), but one thing to keep in mind is that I was only able to configure a single boot disk. I configured the rear two drives into a hardware raid 1 and set the virtual disk as the boot disk so that I can still boot the system if an OS disk fails. Once Illumos is better supported on the R720 and the PERC H310, I plan to get rid of the hypervisor silliness and run Illumos on bare metal. How about reflashing LSI firmware to the card? I read on Dell's spec sheets that the card runs an LSISAS2008 chip, so chances are that standard LSI firmware will work on it. I can send you all the required bits to do the reflash, if you like. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] MPxIO n00b question
I'm currently trying to get a SuperMicro JBOD with dual SAS expander chips running in MPxIO, but I'm a total amateur to this and would like to ask about how to detect whether MPxIO is working (or not). My SAS topology is: *) One LSI SAS2008-equipped HBA (running the latest IT firmware from LSI) with two external ports. *) Two SAS cables running from the HBA to the SuperMicro JBOD, where they enter the JBOD's rear backplane (which is equipped with two LSI SAS expander chips). *) From the rear backplane, via two internal SAS cables to the front backplane (also with two SAS expanders on it) *) The JBOD is populated with 45 2TB Toshiba SAS 7200rpm drives The machine also has a PERC H700 for the boot media, configured into a hardware RAID-1 (on which rpool resides). Here is the relevant part from cfgadm -al for the MPxIO bits: c5 scsi-sas connectedconfigured unknown c5::dsk/c5t5393D8CB4452d0 disk connectedconfigured unknown c5::dsk/c5t5393E8C90CF2d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF2A6d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF2AAd0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF2BEd0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF2C6d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF2E2d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF2F2d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF5C6d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF28Ad0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF32Ed0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF35Ad0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF35Ed0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF36Ad0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF36Ed0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF52Ed0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF53Ad0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF53Ed0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF312d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF316d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF506d0 disk connectedconfigured unknown c5::dsk/c5t5393E8CAF546d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C84F5Ed0 disk connectedconfigured unknown c5::dsk/c5t5393F8C84FBAd0 disk connectedconfigured unknown c5::dsk/c5t5393F8C851EEd0 disk connectedconfigured unknown c5::dsk/c5t5393F8C852A6d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C852C2d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C852CAd0 disk connectedconfigured unknown c5::dsk/c5t5393F8C852EAd0 disk connectedconfigured unknown c5::dsk/c5t5393F8C854BAd0 disk connectedconfigured unknown c5::dsk/c5t5393F8C854E2d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C855AAd0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8509Ad0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8520Ad0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8528Ad0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8530Ed0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8531Ed0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8557Ed0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8558Ed0 disk connectedconfigured unknown c5::dsk/c5t5393F8C8560Ad0 disk connectedconfigured unknown c5::dsk/c5t5393F8C85106d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C85222d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C85246d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C85366d0 disk connectedconfigured unknown c5::dsk/c5t5393F8C85636d0 disk connectedconfigured unknown c5::es/ses0ESI connectedconfigured unknown c5::es/ses1ESI connectedconfigured unknown c5::smp/expd0 smp connectedconfigured unknown c5::smp/expd1 smp connectedconfigured unknown c6 scsi-sas connectedconfigured unknown c6::dsk/c6t5393D8CB4453d0 disk connectedconfigured unknown c6::dsk/c6t5393E8C90CF3d0 disk connectedconfigured unknown c6::dsk/c6t5393E8CAF2A7d0 disk connectedconfigured unknown c6::dsk/c6t5393E8CAF2ABd0 disk
Re: [zfs-discuss] MPxIO n00b question
On 05/25/2012 07:35 PM, Jim Klimov wrote: Sorry I can't comment on MPxIO, except that I thought zfs could by itself discern two paths to the same drive, if only to protect against double-importing the disk into pool. Unfortunately, it isn't the same thing. MPxIO provides redundant signaling to the drives, independent of the storage/RAID layer above it, so it does have its place (besides simply increasing throughput). I am not sure it is a good idea to use such low protection (raidz1) with large drives. At least, I was led to believe that after 2Tb in size raidz2 is preferable, and raidz3 is optimal due to long scrub/resilver times leading to large timeframes that a pool with an error is exposed to possible fatal errors (due to double-failures with single-protection). I'd use lower protection if it were available :) The data on that array is not very important, the primary design parameter is low cost per MB. We're in a very demanding IO environment, we need large quantities of high-throughput, high-IOPS storage, but we don't need stellar reliability. If the pool gets corrupted due to unfortunate double-drive failure, well, that's tough, but not unbearable (the pool stores customer channel recordings for nPVR, so nothing critical really). -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] MPxIO n00b question
On 05/25/2012 08:40 PM, Richard Elling wrote: See the soluion at https://www.illumos.org/issues/644 -- richard Good Lord, that was it! It never occurred to me that the drives had a say in this. Thanks a billion! Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Dell PERC H200: drive failed to power up
Hi, I'm getting weird errors while trying to install openindiana 151a on a Dell R715 with a PERC H200 (based on an LSI SAS 2008). Any time the OS tries to access the drives (for whatever reason), I get this dumped into syslog: genunix: WARNING: Device /pci@0,0/pci1002,5a18@4/pci10b58424@0/pci10b5,8624@0/pci1028,1f1e@0/iport@40/disk@w5c0f01004ebe,0 failed to power up genunix: WARNING: Device /pci@0,0/pci1002,5a18@4/pci10b58424@0/pci10b5,8624@0/pci1028,1f1e@0/iport@80/disk@w5c0f01064e9e,0 failed to power up (these are two WD 300GB 10k SAS drives) When this log message shows up, I can see each drive light up the drive LED briefly and then it turns off, so apparently the OS tried to initialize the drives, but somehow failed and gave up. Consequently, when I try and access them in format(1), they show up as an unknown type and installing openindiana on them fails while the installer is trying to do fdisk. Has anybody got any idea what I can do to the controller/drives/whatever to fix the failed to power up problem? One would think that a LSI SAS 2008 chip would be problem free under Solaris (the server even lists Oracle Solaris as an officially supported OS), but alas, I have yet to succeed. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dell PERC H200: drive failed to power up
On 05/16/2012 09:45 AM, Koopmann, Jan-Peter wrote: Hi, are those DELL branded WD disks? DELL tends to manipulate the firmware of the drives so that power handling with Solaris fails. If this is the case here: Easiest way to make it work is to modify /kernel/drv/sd.conf and add an entry for your specific drive similar to this sd-config-list= WD WD2000FYYG,power-condition:false, SEAGATE ST2000NM0001,power-condition:false, SEAGATE ST32000644NS,power-condition:false, SEAGATE ST91000640SS,power-condition:false; Naturally you would have to find out the correct drive names. My latest version for a R710 with a MD1200 attached is: sd-config-list=SEAGATE ST2000NM0001,power-condition:false, SEAGATE ST1000NM0001,power-condition:false, SEAGATE ST91000640SS,power-condition:false; Are you using the H200 with the base firmware or did you flash it to LSI IT? I am not sure that Solaris handles the H200 natively at all and if then it will not have direct drive access since the H200 will only show virtual drives to Solaris/OI will it not? They are Dell branded WD disks and I haven't done anything to the HBA's firmware, so that's stock Dell as well. The drives are, specifically are WD3001BKHG models. The firmware actually does expose the disks unless they're part of a RAID group, so that should actually work. I'll try te power-condition workaround you mentioned. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dell PERC H200: drive failed to power up
On 05/16/2012 09:45 AM, Koopmann, Jan-Peter wrote: Hi, are those DELL branded WD disks? DELL tends to manipulate the firmware of the drives so that power handling with Solaris fails. If this is the case here: Easiest way to make it work is to modify /kernel/drv/sd.conf and add an entry for your specific drive similar to this sd-config-list= WD WD2000FYYG,power-condition:false, SEAGATE ST2000NM0001,power-condition:false, SEAGATE ST32000644NS,power-condition:false, SEAGATE ST91000640SS,power-condition:false; Naturally you would have to find out the correct drive names. My latest version for a R710 with a MD1200 attached is: sd-config-list=SEAGATE ST2000NM0001,power-condition:false, SEAGATE ST1000NM0001,power-condition:false, SEAGATE ST91000640SS,power-condition:false; Are you using the H200 with the base firmware or did you flash it to LSI IT? I am not sure that Solaris handles the H200 natively at all and if then it will not have direct drive access since the H200 will only show virtual drives to Solaris/OI will it not? Kind regards, JP PS: These are not my findings. Cudos to Sergei (tehc...@gmail.com) and Niklas Tungström. One thing came up while trying this - I'm on a text install image system, so my / is a ramdisk. Any ideas how I can change the sd.conf on the USB disk or reload the driver configuration on the fly? I tried looking for the file on the USB drive, but it isn't in the rootfs (perhaps it's tucked away in some compressed filesystem image). Thanks! -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dell PERC H200: drive failed to power up
On 05/16/2012 10:17 AM, Koopmann, Jan-Peter wrote: One thing came up while trying this - I'm on a text install image system, so my / is a ramdisk. Any ideas how I can change the sd.conf on the USB disk or reload the driver configuration on the fly? I tried looking for the file on the USB drive, but it isn't in the rootfs (perhaps it's tucked away in some compressed filesystem image). Thanks! I am by no means a Solaris or OI guru and live from good advice of other people and Mr. Google. So sorry. I have no clueŠ I got lucky at Googling after all and found the relevant command: # update_drv -vf sd The PERC H200 card had nothing to do with it, it was all in the crappy firmware in the HDDs. Simply adding sd-config-list = WD WD3001BKHG,power-condition:false; to my /kernel/drv/sd.conf (as you suggested) and reloading the driver using update_drv solved it and I could then proceed with the installation. The installer was even smart enough to install the customized sd.conf into the new system, so no further tuning was necessary. Thanks for the pointers, you saved my bacon. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Windows 8 ReFS (OT)
On 01/17/2012 01:06 AM, David Magda wrote: Kind of off topic, but I figured of some interest to the list. There will be a new file system in Windows 8 with some features that we all know and love in ZFS: As mentioned previously, one of our design goals was to detect and correct corruption. This not only ensures data integrity, but also improves system availability and online operation. Thus, all ReFS metadata is check-summed at the level of a B+ tree page, and the checksum is stored independently from the page itself. [...] Once ReFS detects such a failure, it interfaces with Storage Spaces to read all available copies of data and chooses the correct one based on checksum validation. It then tells Storage Spaces to fix the bad copies based on the good copies. All of this happens transparently from the point of view of the application. Looks like what the Btrfs people were trying to do. -- S ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing txg commit frequency
On 07/01/2011 12:01 AM, Sašo Kiselkov wrote: On 06/30/2011 11:56 PM, Sašo Kiselkov wrote: Hm, it appears I'll have to do some reboots and more extensive testing. I tried tuning various settings and then returned everything back to the defaults. Yet, now I can ramp the number of concurrent output streams to ~170 instead of the original 95 (even then the pool still has capacity left, I'm actually running out of CPU power). The txg commit occurs at roughly every 15 (or so) seconds, which is what I wanted. Strange that this occurs even after I returned everything to the defaults... I'll try doing some more testing on this once I move the production deployment to a different system and I'll have more time to experiment with this machine. Anyways, thanks for the suggestions, it helped a lot. Regards, -- Saso Just a follow correction: one parameter was indeed changed: zfs_write_limit_inflated. In the source it's set to zero, I've set it to 0x2. So it seems I was wrong after all and it didn't help. So the question remains: is there a way how to force ZFS *NOT* to commit a txg before a certain minimum amount of data has accumulated in it, or before the txg timeout is reached? All the best, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HP JBOD D2700 - ok?
On 11/30/2011 02:40 PM, Edmund White wrote: Absolutely. I'm using a fully-populated D2700 with an HP ProLiant DL380 G7 server running NexentaStor. On the HBA side, I used the LSI 9211-8i 6G controllers for the server's internal disks (boot, a handful of large disks, Pliant SSDs for L2Arc). There is also a DDRDrive for ZIL. To connect to the D2700 enclosure, I used 2 x LSI 9205 6G HBAs; one 4-lane SAS cable per storage controller on the D2700. These were setup with MPxIO (dual controllers, dual paths, dual-ported disks) and required a slight bit of tuning of /kernel/drv/scsi_vhci.conf, but the performance is great now. The enclosure is supported and I've been able to setup drive slot maps and control disk LED's, etc. Coincidentally, I'm also thinking about getting a few D2600 enclosures, but I've been considering attaching them via a pair of HP SC08Ge 6G SAS HBAs. Has anybody had any experience with these HBAs? According to a few searches on the Internet, it should be a rebranded LSI9200-8e. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing txg commit frequency
On 06/30/2011 01:10 PM, Jim Klimov wrote: 2011-06-30 11:47, Sašo Kiselkov пишет: On 06/30/2011 02:49 AM, Jim Klimov wrote: 2011-06-30 2:21, Sašo Kiselkov пишет: On 06/29/2011 02:33 PM, Sašo Kiselkov wrote: Also there is a buffer-size limit, like this (384Mb): set zfs:zfs_write_limit_override = 0x1800 or on command-line like this: # echo zfs_write_limit_override/W0t402653184 | mdb -kw Currently my value for this is 0. How should I set it? I'm writing ~15MB/s and would like txg flushes to occur at most once every 10 seconds. Should I set it to 150MB then? We had similar spikes with big writes to a Thumper with SXCE in the pre-90's builds, when the system would stall for seconds while flushing a 30-second TXG full of data. Adding a reasonable megabyte limit solved the unresponsiveness problem for us, by making these flush-writes rather small and quick. I need to do the opposite - I don't need to shorten the interval of writes, I need to increase it. Can I do that using zfs_write_limit_override? Just as a folloup, I've had a look at the tunables in dsl_pool.c and found that I could potentially influence the write pressure calculation by tuning zfs_txg_synctime_ms - do you think increasing this value from its default (1000ms) help me lower the write scheduling frequency? (I don't mind if a txg write takes even twice as long, my application buffers are on average 6 seconds long.) Regards, -- Saso It might help. In my limited testing on oi_148a, it seems that zfs_txg_synctime_ms and zfs_txg_timeout are linked somehow (i.e. changing one value changed the other accordingly). So in effect they may be two names for the same tunable (one in single units of secs, another in thousands of msecs). Well, to my understanding, zfs_txg_timeout is the timer limit on flushing pending txgs to disk - if the timer fires the current txg is written to disk regardless of its size. Otherwise the txg scheduling algorithm should take into account I/O pressure on the pool, estimate the remaining write bandwidth and fire when it estimates that a txg commit would overflow zfs_txg_synctime[_ms]. I tried increasing this value to 2000 or 3000, but without an effect - prehaps I need to set it at pool mount time or in /etc/system. Could somebody with more knowledge of these internals please chime in? Somewhere in our discussion the Reply-to-all was lost. Back to the list :) Saso: Did you try setting both the timeout limit and the megabyte limit values, and did you see system IO patterns correlate with these values? My understanding was lke yours above, so if things are different in reality - I'm interested to know too. PS: I don't think you wrote: which OS version do you use? Thanks for the suggestions, I'll try them out. I'm running oi_148. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing txg commit frequency
On 06/30/2011 01:33 PM, Jim Klimov wrote: 2011-06-30 15:22, Sašo Kiselkov пишет: I tried increasing this value to 2000 or 3000, but without an effect - prehaps I need to set it at pool mount time or in /etc/system. Could somebody with more knowledge of these internals please chime in? And about this part - it was my understanding and experience (from SXCE) that these values can be set at run-time and are used as soon as set (or maybe in a few TXGs - but visibly in real-time). Also I've seen instant result from setting the TXG sync times on oi_148a with little loads (in my thread about trying to account for some 2Mb writes to my root pool) - this could be 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG timeout currently set value. Hm, it appears I'll have to do some reboots and more extensive testing. I tried tuning various settings and then returned everything back to the defaults. Yet, now I can ramp the number of concurrent output streams to ~170 instead of the original 95 (even then the pool still has capacity left, I'm actually running out of CPU power). The txg commit occurs at roughly every 15 (or so) seconds, which is what I wanted. Strange that this occurs even after I returned everything to the defaults... I'll try doing some more testing on this once I move the production deployment to a different system and I'll have more time to experiment with this machine. Anyways, thanks for the suggestions, it helped a lot. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing txg commit frequency
On 06/30/2011 11:56 PM, Sašo Kiselkov wrote: On 06/30/2011 01:33 PM, Jim Klimov wrote: 2011-06-30 15:22, Sašo Kiselkov пишет: I tried increasing this value to 2000 or 3000, but without an effect - prehaps I need to set it at pool mount time or in /etc/system. Could somebody with more knowledge of these internals please chime in? And about this part - it was my understanding and experience (from SXCE) that these values can be set at run-time and are used as soon as set (or maybe in a few TXGs - but visibly in real-time). Also I've seen instant result from setting the TXG sync times on oi_148a with little loads (in my thread about trying to account for some 2Mb writes to my root pool) - this could be 2Mb/s or 0.2Mb/s (all in 2Mb bursts though) depending on TXG timeout currently set value. Hm, it appears I'll have to do some reboots and more extensive testing. I tried tuning various settings and then returned everything back to the defaults. Yet, now I can ramp the number of concurrent output streams to ~170 instead of the original 95 (even then the pool still has capacity left, I'm actually running out of CPU power). The txg commit occurs at roughly every 15 (or so) seconds, which is what I wanted. Strange that this occurs even after I returned everything to the defaults... I'll try doing some more testing on this once I move the production deployment to a different system and I'll have more time to experiment with this machine. Anyways, thanks for the suggestions, it helped a lot. Regards, -- Saso Just a follow correction: one parameter was indeed changed: zfs_write_limit_inflated. In the source it's set to zero, I've set it to 0x2. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing txg commit frequency
On 06/29/2011 02:33 PM, Sašo Kiselkov wrote: Also there is a buffer-size limit, like this (384Mb): set zfs:zfs_write_limit_override = 0x1800 or on command-line like this: # echo zfs_write_limit_override/W0t402653184 | mdb -kw Currently my value for this is 0. How should I set it? I'm writing ~15MB/s and would like txg flushes to occur at most once every 10 seconds. Should I set it to 150MB then? We had similar spikes with big writes to a Thumper with SXCE in the pre-90's builds, when the system would stall for seconds while flushing a 30-second TXG full of data. Adding a reasonable megabyte limit solved the unresponsiveness problem for us, by making these flush-writes rather small and quick. I need to do the opposite - I don't need to shorten the interval of writes, I need to increase it. Can I do that using zfs_write_limit_override? Just as a folloup, I've had a look at the tunables in dsl_pool.c and found that I could potentially influence the write pressure calculation by tuning zfs_txg_synctime_ms - do you think increasing this value from its default (1000ms) help me lower the write scheduling frequency? (I don't mind if a txg write takes even twice as long, my application buffers are on average 6 seconds long.) Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing txg commit frequency
On 06/27/2011 11:59 AM, Jim Klimov wrote: I'd like to ask about whether there is a method to enforce a certain txg commit frequency on ZFS. Well, there is a timer frequency based on TXG age (i.e 5 sec by default now), in /etc/system like this: set zfs:zfs_txg_synctime = 5 When trying to read the value through mdb I get: # echo zfs_txg_synctime::print | mdb -k mdb: failed to dereference symbol: unknown symbol name Is this some new addition in S11E? Also there is a buffer-size limit, like this (384Mb): set zfs:zfs_write_limit_override = 0x1800 or on command-line like this: # echo zfs_write_limit_override/W0t402653184 | mdb -kw Currently my value for this is 0. How should I set it? I'm writing ~15MB/s and would like txg flushes to occur at most once every 10 seconds. Should I set it to 150MB then? We had similar spikes with big writes to a Thumper with SXCE in the pre-90's builds, when the system would stall for seconds while flushing a 30-second TXG full of data. Adding a reasonable megabyte limit solved the unresponsiveness problem for us, by making these flush-writes rather small and quick. I need to do the opposite - I don't need to shorten the interval of writes, I need to increase it. Can I do that using zfs_write_limit_override? Thanks. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fixing txg commit frequency
On 06/26/2011 06:17 PM, Richard Elling wrote: On Jun 24, 2011, at 5:29 AM, Sašo Kiselkov wrote: Hi All, I'd like to ask about whether there is a method to enforce a certain txg commit frequency on ZFS. I'm doing a large amount of video streaming from a storage pool while also slowly continuously writing a constant volume of data to it (using a normal file descriptor, *not* in O_SYNC). When reading volume goes over a certain threshold (and average pool load over ~50%), ZFS thinks it's running out of steam on the storage pool and starts committing transactions more often which results in even greater load on the pool. This leads to a sudden spike in I/O utilization on the pool in roughly the following method: # streaming clients pool load [%] 15 8% 20 11% 40 22% 60 33% 80 44% --- around here txg timeouts start to shorten --- 85 60% 90 70% 95 85% What is a pool load? We expect 100% utilization during the txg commit, anything else is a waste. I suspect that you actually want more, smaller commits to spread the load more evenly. This is easy to change, but unless you can tell us what OS you are running, including version, we don't have a foundation to build upon. -- richard Pool load is a 60 seconds average of the aggregated util percentages as reported by iostat -D for the disks which comprise the pool (So I run iostat -Dn {pool-disks} 60 and compute the load for each row printed as an average of the util columns). Interestingly enough, when watching 1-second updates in iostat I never see util hit 100% during a txg commit, even if it takes two or more seconds to complete. This tells me that the disks still have enough performance headroom so that zfs doesn't really need to shorten the interval at which commits occur. I'm running oi_148, and all pools are zfs version 28. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Fixing txg commit frequency
Hi All, I'd like to ask about whether there is a method to enforce a certain txg commit frequency on ZFS. I'm doing a large amount of video streaming from a storage pool while also slowly continuously writing a constant volume of data to it (using a normal file descriptor, *not* in O_SYNC). When reading volume goes over a certain threshold (and average pool load over ~50%), ZFS thinks it's running out of steam on the storage pool and starts committing transactions more often which results in even greater load on the pool. This leads to a sudden spike in I/O utilization on the pool in roughly the following method: # streaming clientspool load [%] 15 8% 20 11% 40 22% 60 33% 80 44% --- around here txg timeouts start to shorten --- 85 60% 90 70% 95 85% My application does a fair bit of caching and prefetching, so I have zfetch disabled and primarycache set to only metadata. Also, reads happen (on a per client basis) relatively infrequently, so I can easily take it if the pool stops reading for a few seconds and just writes data. The problem is, ZFS starts alternating between reads and writes really quickly, which in turn starves me on IOPS and results in a huge load spike. Judging on load numbers up to around 80 concurrent clients, I suspect I could go up to 150 concurrent clients on this pool, but because of this spike I top out at around 95-100 concurrent clients. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring disk seeks
On 05/24/2011 03:08 PM, a.sm...@ukgrid.net wrote: Hi, see the seeksize script on this URL: http://prefetch.net/articles/solaris.dtracetopten.html Not used it but looks neat! cheers Andy. I already did and it does the job just fine. Thank you for your kind suggestion. BR, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring disk seeks
On 05/19/2011 07:47 PM, Richard Elling wrote: On May 19, 2011, at 5:35 AM, Sašo Kiselkov wrote: Hi all, I'd like to ask whether there is a way to monitor disk seeks. I have an application where many concurrent readers (50) sequentially read a large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor read/write ops using iostat, but that doesn't tell me how contiguous the data is, i.e. when iostat reports 500 read ops, does that translate to 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks! In general, this is hard to see from the OS. In Solaris, the default I/O flowing through sd gets sorted based on LBA before being sent to the disk. If the disks gets more than 1 concurrent I/O request (10 is the default for Solaris-based ZFS) then the disk can resort or otherwise try to optimize the media accesses. As others have mentioned, iopattern is useful for looking a sequential patterns. I've made some adjustments for the version at http://www.richardelling.com/Home/scripts-and-programs-1/iopattern You can see low-level SCSI activity using scsi.d, but I usually uplevel that to using iosnoop -Dast which shows each I/O and its response time. Note that the I/Os can complete out-of-order on many devices. The only device I know that is so fast and elegant that it always completes in-order is the DDRdrive. For detailed analysis of iosnoop data, you will appreciate a real statistics package. I use JMP, but others have good luck with R. -- richard Thank you, the iopattern script seems to be quite close to what I wanted. The percentage split between random and sequential I/O is pretty much what I needed to know. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Monitoring disk seeks
Hi all, I'd like to ask whether there is a way to monitor disk seeks. I have an application where many concurrent readers (50) sequentially read a large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor read/write ops using iostat, but that doesn't tell me how contiguous the data is, i.e. when iostat reports 500 read ops, does that translate to 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks! Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring disk seeks
On 05/19/2011 03:35 PM, Tomas Ögren wrote: On 19 May, 2011 - Sa??o Kiselkov sent me these 0,6K bytes: Hi all, I'd like to ask whether there is a way to monitor disk seeks. I have an application where many concurrent readers (50) sequentially read a large dataset (10T) at a fairly low speed (8-10 Mbit/s). I can monitor read/write ops using iostat, but that doesn't tell me how contiguous the data is, i.e. when iostat reports 500 read ops, does that translate to 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks! Get DTraceToolkit and check out the various things under Disk and FS, might help. /Tomas Thank you all for the tips, I'll try to poke around using the DTrackToolkit. Regards, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 04/09/2011 01:41 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Julian King Actually I think our figures more or less agree. 12 disks = 7 mbits 48 disks = 4x7mbits I know that sounds like terrible performance to me. Any time I benchmark disks, a cheap generic SATA can easily sustain 500Mbit, and any decent drive can easily sustain 1Gbit. I think he mistyped and meant 7gbit/s. Of course it's lower when there's significant random seeking happening... But if you have a data model which is able to stream sequentially, the above is certainly true. Unfortunately, this is exactly my scenario, where I want to stream large volumes of data in many concurrent threads over large datasets which have no hope of fitting in RAM or L2ARC and with generally very little locality. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 04/08/2011 05:20 PM, Mark Sandrock wrote: On Apr 8, 2011, at 7:50 AM, Evaldas Auryla evaldas.aur...@edqm.eu wrote: On 04/ 8/11 01:14 PM, Ian Collins wrote: You have built-in storage failover with an AR cluster; and they do NFS, CIFS, iSCSI, HTTP and WebDav out of the box. And you have fairly unlimited options for application servers, once they are decoupled from the storage servers. It doesn't seem like much of a drawback -- although it may be for some smaller sites. I see AR clusters going in in local high schools and small universities. Which is all fine and dandy if you have a green field, or are willing to re-architect your systems. We just wanted to add a couple more x4540s! Hi, same here, it's a sad news that Oracle decided to stop x4540s production line. Before, ZFS geeks had choice - buy 7000 series if you want quick out of the box storage with nice GUI, or build your own storage with x4540 line, which by the way has brilliant engineering design, the choice is gone now. Okay, so what is the great advantage of an X4540 versus X86 server plus disk array(s)? Mark Several: 1) Density: The X4540 has far greater density than 1U server + Sun's J4200 or J4400 storage arrays. The X4540 did 12 disks / 1RU, whereas a 1U + 2xJ4400 only manages ~5.3 disks / 1RU. 2) Number of components involved: server + disk enclosure means you have more PSUs which can die on you, more cabling to accidentally disconnect and generally more hassle with installation. 3) Spare management: With the X4540 you only have to have one kind of spare component: the server. With servers + enclosures, you might need to keep multiple. I agree that besides 1), both 2) a 3) are a relatively trivial problem to solve. Of course, server + enclosure builds do have their place, such as when you might need to scale, but even then you could just hook them up to a X4540 (or purchase a new one - I never quite understood why the storage-enclosure-only variant of the X4540 case was more expensive than an identical server). In short, I think the X4540 was an elegant and powerful system that definitely had its market, especially in my area of work (digital video processing - heavy on latency, throughput and IOPS - an area, where the 7000-series with its over-the-network access would just be a totally useless brick). -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4540 no next-gen product?
On 04/08/2011 06:59 PM, Darren J Moffat wrote: On 08/04/2011 17:47, Sašo Kiselkov wrote: In short, I think the X4540 was an elegant and powerful system that definitely had its market, especially in my area of work (digital video processing - heavy on latency, throughput and IOPS - an area, where the 7000-series with its over-the-network access would just be a totally useless brick). As an engineer I'm curious have you actually tried a suitably sized S7000 or are you assuming it won't perform suitably for you ? No, I haven't tried a S7000, but I've tried other kinds of network storage and from a design perspective, for my applications, it doesn't even make a single bit of sense. I'm talking about high-volume real-time video streaming, where you stream 500-1000 (x 8Mbit/s) live streams from a machine over UDP. Having to go over the network to fetch the data from a different machine is kind of like building a proxy which doesn't really do anything - if the data is available from a different machine over the network, then why the heck should I just put another machine in the processing path? For my applications, I need a machine with as few processing components between the disks and network as possible, to maximize throughput, maximize IOPS and minimize latency and jitter. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Network video streaming [Was: Re: X4540 no next-gen product?]
On 04/08/2011 07:22 PM, J.P. King wrote: No, I haven't tried a S7000, but I've tried other kinds of network storage and from a design perspective, for my applications, it doesn't even make a single bit of sense. I'm talking about high-volume real-time video streaming, where you stream 500-1000 (x 8Mbit/s) live streams from a machine over UDP. Having to go over the network to fetch the data from a different machine is kind of like building a proxy which doesn't really do anything - if the data is available from a different machine over the network, then why the heck should I just put another machine in the processing path? For my applications, I need a machine with as few processing components between the disks and network as possible, to maximize throughput, maximize IOPS and minimize latency and jitter. I can't speak for this particular situation or solution, but I think in principle you are wrong. Networks are fast. Hard drives are slow. Put a 10G connection between your storage and your front ends and you'll have the bandwidth[1]. Actually if you really were hitting 1000x8Mbits I'd put 2, but that is just a question of scale. In a different situation I have boxes which peak at around 7 Gb/s down a 10G link (in reality I don't need that much because it is all about the IOPS for me). That is with just twelve 15k disks. Your situation appears to be pretty ideal for storage hardware, so perfectly achievable from an appliance. I envision this kind of scenario (using my fancy ASCII art skills :-)): || = streaming server || +-+ SAS +-+ PCI-e +-+ Ethernet ++ |DISKS| === | RAM | | NIC | === | client | +-+ +-+ +-+ ++ And you are advocating for this kind of scenario: || network storage = || +-+ SAS +-+ PCI-e +-+ Ethernet |DISKS| === | RAM | | NIC | ... +-+ +-+ +-+ || = streaming server == || +-+ PCI-e +-+ PCI-e +-+ Ethernet ++ ... == | NIC | | RAM | | NIC | === | client | +-+ +-+ +-+ ++ I'm not constrained on CPU (so hooking up multiple streaming servers to one backend storage doesn't really make sense). So what exactly what does this scenario add to my needs (besides needing extra hardware in both the storage and server (10G NICs, cabling, modules, etc.)? I'm not saying no, I'd love to improve the throughput, IOPS and latency characteristics of my systems. I can't speak for the S7000 range. I ignored that entire product line because when I asked about it the markup was insane compared to just buying X4500/X4540s. The price for Oracle kit isn't remotely tenable, so the death of the X45xx range is a moot point for me anyway, since I couldn't afford it. [1] Just in case, you also shouldn't be adding any particularly significant latency either. Jitter, maybe, depending on the specifics of the streams involved. Saso Julian -- Julian King Computer Officer, University of Cambridge, Unix Support ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On 01/07/2011 10:26 AM, Darren J Moffat wrote: On 06/01/2011 23:07, David Magda wrote: On Jan 6, 2011, at 15:57, Nicolas Williams wrote: Fletcher is faster than SHA-256, so I think that must be what you're asking about: can Fletcher+Verification be faster than Sha256+NoVerification? Or do you have some other goal? Would running on recent T-series servers, which have have on-die crypto units, help any in this regard? The on chip SHA-256 implementation is not yet used see: http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via Note that the fix I integrated only uses a software implementation of SHA256 on the T5120 (UltraSPARC T2) and is not (yet) using the on CPU hardware implementation of SHA256. The reason for this is to do with boot time availability of the Solaris Cryptographic Framework and the need to have ZFS as the root filesystem. Not yet changed it turns out to be quite complicated to fix due to very early boot issues. Would it be difficult to implement both methods and allow ZFS to switch to the hardware-accelerated crypto backend at runtime after it has been brought up and initialized? It seems like one heck of a feature (essentially removing most of the computational complexity of dedup). -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On 01/07/2011 01:15 PM, Darren J Moffat wrote: On 07/01/2011 11:56, Sašo Kiselkov wrote: On 01/07/2011 10:26 AM, Darren J Moffat wrote: On 06/01/2011 23:07, David Magda wrote: On Jan 6, 2011, at 15:57, Nicolas Williams wrote: Fletcher is faster than SHA-256, so I think that must be what you're asking about: can Fletcher+Verification be faster than Sha256+NoVerification? Or do you have some other goal? Would running on recent T-series servers, which have have on-die crypto units, help any in this regard? The on chip SHA-256 implementation is not yet used see: http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via Note that the fix I integrated only uses a software implementation of SHA256 on the T5120 (UltraSPARC T2) and is not (yet) using the on CPU hardware implementation of SHA256. The reason for this is to do with boot time availability of the Solaris Cryptographic Framework and the need to have ZFS as the root filesystem. Not yet changed it turns out to be quite complicated to fix due to very early boot issues. Would it be difficult to implement both methods and allow ZFS to switch to the hardware-accelerated crypto backend at runtime after it has been brought up and initialized? It seems like one heck of a feature Wither it is difficult or not depends on your level of familiarity with ZFS, boot and the cryptographic framework ;-) For me no it wouldn't be difficult but it still isn't completely trivial. (essentially removing most of the computational complexity of dedup). Most of the data I've seen on the performance impact of dedup is not coming from the SHA256 computation it is mostly about the additional IO to deal with the DDT. Though lowering the overhead that SHA256 does add is always a good thing. Well, seeing as all mainline ZFS development is now happening behind closed doors, all I can really do is ask for features and hope Oracle implements them :-). Nevertheless, thanks for the clarification. BR, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss