Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
Hi Garrett, Since my problem did turn out to be a debug kernel on my compilations, I booted back into the Nexanta 3 RC2 CD and let a scrub run for about half an hour to see if I just hadn't waited long enough the first time around. It never made it past 159 MB/s. I finally rebooted into my 145 non-debug kernel and within a few seconds of reimporting the pool the scrub was up to ~400 MB/s, so it does indeed seem like the Nexanta CD kernel is either in debug mode, or something else is slowing it down. Chad On Wed, Jul 21, 2010 at 09:12:35AM -0700, Garrett D'Amore wrote: On Wed, 2010-07-21 at 02:21 -0400, Richard Lowe wrote: I built in the normal fashion, with the CBE compilers (cc: Sun C 5.9 SunOS_i386 Patch 124868-10 2009/04/30), and 12u1 lint. I'm not subscribed to zfs-discuss, but have you established whether the problematic build is DEBUG? (the bits I uploaded were non-DEBUG). That would make a *huge* difference. DEBUG bits have zero optimization, and also have a great number of sanity tests included that are absent from the non-DEBUG bits. If these are expensive checks on a hot code path, it can have a very nasty impact on performance. Now that said, I *hope* the bits that Nexenta delivered were *not* DEBUG. But I've seen at least one bug that makes me think we might be delivering DEBUG binaries. I'll check into it. -- Garrett -- Rich Haudy Kazemi wrote: Could it somehow not be compiling 64-bit support? -- Brent Jones I thought about that but it says when it boots up that it is 64-bit, and I'm able to run 64-bit binaries. I wonder if it's compiling for the wrong processor optomization though? Maybe if it is missing some of the newer SSEx instructions the zpool checksum checking is slowed down significantly? I don't know how to check for this though and it seems strange it would slow it down this significantly. I'd expect even a non-SSE enabled binary to be able to calculate a few hundred MB of checksums per second for a 2.5+ghz processor. Chad Would it be possible to do a closer comparison between Rich Lowe's fast 142 build and your slow 142 build? For example run a diff on the source, build options, and build scripts. If the build settings are close enough, a comparison of the generated binaries might be a faster way to narrow things down (if the optimizations are different then a resultant binary comparison probably won't be useful). You said previously that: The procedure I followed was basically what is outlined here: http://insanum.com/blog/2010/06/08/how-to-build-opensolaris using the SunStudio 12 compilers for ON and 12u1 for lint. Are these the same compiler versions Rich Lowe used? Maybe there is a compiler optimization bug. Rich Lowe's build readme doesn't tell us which compiler he used. http://genunix.org/dist/richlowe/README.txt I suppose the easiest way for me to confirm if there is a regression or if my compiling is flawed is to just try compiling snv_142 using the same procedure and see if it works as well as Rich Lowe's copy or if it's slow like my other compilations. Chad Another older compilation guide: http://hub.opensolaris.org/bin/view/Community+Group+tools/building_opensolaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
Hi, My bits were originally debug because I didn't know any better. I thought I had then recompiled without debug to test again, but I didn't realize until just now the packages end up in a different directory (nightly vs nightly-nd) so I believe after compiling non-debug I just reinstalled the debug bits. I'm about to test again with an actual non-debug 142, and after that a non-debug 145 which just came out. Thanks, Chad On Wed, Jul 21, 2010 at 02:21:51AM -0400, Richard Lowe wrote: I built in the normal fashion, with the CBE compilers (cc: Sun C 5.9 SunOS_i386 Patch 124868-10 2009/04/30), and 12u1 lint. I'm not subscribed to zfs-discuss, but have you established whether the problematic build is DEBUG? (the bits I uploaded were non-DEBUG). -- Rich Haudy Kazemi wrote: Could it somehow not be compiling 64-bit support? -- Brent Jones I thought about that but it says when it boots up that it is 64-bit, and I'm able to run 64-bit binaries. I wonder if it's compiling for the wrong processor optomization though? Maybe if it is missing some of the newer SSEx instructions the zpool checksum checking is slowed down significantly? I don't know how to check for this though and it seems strange it would slow it down this significantly. I'd expect even a non-SSE enabled binary to be able to calculate a few hundred MB of checksums per second for a 2.5+ghz processor. Chad Would it be possible to do a closer comparison between Rich Lowe's fast 142 build and your slow 142 build? For example run a diff on the source, build options, and build scripts. If the build settings are close enough, a comparison of the generated binaries might be a faster way to narrow things down (if the optimizations are different then a resultant binary comparison probably won't be useful). You said previously that: The procedure I followed was basically what is outlined here: http://insanum.com/blog/2010/06/08/how-to-build-opensolaris using the SunStudio 12 compilers for ON and 12u1 for lint. Are these the same compiler versions Rich Lowe used? Maybe there is a compiler optimization bug. Rich Lowe's build readme doesn't tell us which compiler he used. http://genunix.org/dist/richlowe/README.txt I suppose the easiest way for me to confirm if there is a regression or if my compiling is flawed is to just try compiling snv_142 using the same procedure and see if it works as well as Rich Lowe's copy or if it's slow like my other compilations. Chad Another older compilation guide: http://hub.opensolaris.org/bin/view/Community+Group+tools/building_opensolaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
It does seem to be faster now that I really installed the non-debug bits. I let it resume a scrub after reboot, and while it's not as fast as it usually is (280 - 300 MB/s vs 500+) I assume it's just presently checking a part of the filesystem currently with smaller files thus reducing the speed, since it's well past the prior limitation. I tested 142 non-debug briefly until the scrub reached at least 250 MB/s and then booted into 145 non-debug where I'm letting the scrub finish now. I'll test the Nexanta disc again to be sure it was slow since I don't recall exactly how much time I gave it in my prior tests for the scrub to reach it's normal speed, although I can't do that until this evening when I'm home again. Chad On Wed, Jul 21, 2010 at 09:44:42AM -0700, Chad Cantwell wrote: Hi, My bits were originally debug because I didn't know any better. I thought I had then recompiled without debug to test again, but I didn't realize until just now the packages end up in a different directory (nightly vs nightly-nd) so I believe after compiling non-debug I just reinstalled the debug bits. I'm about to test again with an actual non-debug 142, and after that a non-debug 145 which just came out. Thanks, Chad On Wed, Jul 21, 2010 at 02:21:51AM -0400, Richard Lowe wrote: I built in the normal fashion, with the CBE compilers (cc: Sun C 5.9 SunOS_i386 Patch 124868-10 2009/04/30), and 12u1 lint. I'm not subscribed to zfs-discuss, but have you established whether the problematic build is DEBUG? (the bits I uploaded were non-DEBUG). -- Rich Haudy Kazemi wrote: Could it somehow not be compiling 64-bit support? -- Brent Jones I thought about that but it says when it boots up that it is 64-bit, and I'm able to run 64-bit binaries. I wonder if it's compiling for the wrong processor optomization though? Maybe if it is missing some of the newer SSEx instructions the zpool checksum checking is slowed down significantly? I don't know how to check for this though and it seems strange it would slow it down this significantly. I'd expect even a non-SSE enabled binary to be able to calculate a few hundred MB of checksums per second for a 2.5+ghz processor. Chad Would it be possible to do a closer comparison between Rich Lowe's fast 142 build and your slow 142 build? For example run a diff on the source, build options, and build scripts. If the build settings are close enough, a comparison of the generated binaries might be a faster way to narrow things down (if the optimizations are different then a resultant binary comparison probably won't be useful). You said previously that: The procedure I followed was basically what is outlined here: http://insanum.com/blog/2010/06/08/how-to-build-opensolaris using the SunStudio 12 compilers for ON and 12u1 for lint. Are these the same compiler versions Rich Lowe used? Maybe there is a compiler optimization bug. Rich Lowe's build readme doesn't tell us which compiler he used. http://genunix.org/dist/richlowe/README.txt I suppose the easiest way for me to confirm if there is a regression or if my compiling is flawed is to just try compiling snv_142 using the same procedure and see if it works as well as Rich Lowe's copy or if it's slow like my other compilations. Chad Another older compilation guide: http://hub.opensolaris.org/bin/view/Community+Group+tools/building_opensolaris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On Mon, Jul 19, 2010 at 07:01:54PM -0700, Chad Cantwell wrote: On Tue, Jul 20, 2010 at 10:54:44AM +1000, James C. McPherson wrote: On 20/07/10 10:40 AM, Chad Cantwell wrote: fyi, everyone, I have some more info here. in short, rich lowe's 142 works correctly (fast) on my hardware, while both my compilations (snv 143, snv 144) and also the nexanta 3 rc2 kernel (134 with backports) are horribly slow. I finally got around to trying rich lowe's snv 142 compilation in place of my own compilation of 143 (and later 144, not mentioned below), and unlike my own two compilations, his works very fast again on my same zpool ( scrubbing avg increased from low 100s to over 400 MB/s within a few minutes after booting into this copy of 142. I should note that since my original message, I also tried booting from a Nexanta Core 3.0 RC2 ISO after realizing it had zpool 26 support backported into 134 and was in fact able to read my zpool despite upgrading the version. Running a scrub from the F2 shell on the Nexanta CD was also slow scrubbing, just like the 143 and 144 that I compiled. So, there seem to be two possibilities. Either (and this seems unlikely) there is a problem introduced post-142 which slows things down, and it occured in 143, 144, and was brought back to 134 with Nexanta's backports, or else (more likely) there is something different or wrong with how I'm compiling the kernel that makes the hardware not perform up to its specifications with a zpool, and possibly the Nexanta 3 RC2 ISO has the same problem as my own compilations. So - what's your env file contents, which closedbins are you using, why crypto bits are you using, and what changeset is your own workspace synced with? James C. McPherson -- Oracle http://www.jmcp.homeunix.com/blog The procedure I followed was basically what is outlined here: http://insanum.com/blog/2010/06/08/how-to-build-opensolaris using the SunStudio 12 compilers for ON and 12u1 for lint. For each build (143, 144) I cloned the exact tag for that build, i.e.: # hg clone ssh://a...@hg.opensolaris.org/hg/onnv/onnv-gate onnv-b144 # cd onnv-b144 # hg update onnv_144 Then I downloaded the corresponding closed and crypto bins from http://dlc.sun.com/osol/on/downloads/b143 or http://dlc.sun.com/osol/on/downloads/b144 The only environemnt variables I modified from the default opensolaris.sh file were the basic ones: GATE, CODEMGR_WS, STAFFER, and ON_CRYPTO_BINS to point to my work directory for the build, my username, and the relevant crypto bin: $ egrep -e ^GATE|^CODEMGR_WS|^STAFFER|^ON_CRYPTO_BINS opensolaris.sh GATE=onnv-b144; export GATE CODEMGR_WS=/work/compiling/$GATE; export CODEMGR_WS STAFFER=chad; export STAFFER ON_CRYPTO_BINS=$CODEMGR_WS/on-crypto-latest.$MACH.tar.bz2 I suppose the easiest way for me to confirm if there is a regression or if my compiling is flawed is to just try compiling snv_142 using the same procedure and see if it works as well as Rich Lowe's copy or if it's slow like my other compilations. Chad I've just compiled and booted into snv_142, and I experienced the same slow dd and scrubbing as I did with my 142 and 143 compilations and with the Nexanta 3 RC2 CD. So, this would seem to indicate a build environment/process flaw rather than a regression. Chad ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
Yes, I think this might have been it. I missed the NIGHTLY_OPTIONS variable in opensolaris and I think it was compiling a debug build. I'm not sure what the ramifications are of this or how much slower a debug build should be, but I'm recompiling a release build now so hopefully all will be well. Thanks, Chad On Tue, Jul 20, 2010 at 08:39:42AM +0100, Robert Milkowski wrote: On 20/07/2010 07:59, Chad Cantwell wrote: I've just compiled and booted into snv_142, and I experienced the same slow dd and scrubbing as I did with my 142 and 143 compilations and with the Nexanta 3 RC2 CD. So, this would seem to indicate a build environment/process flaw rather than a regression. Are you sure it is not a debug vs. non-debug issue? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
No, this wasn't it. A non debug build with the same NIGHTLY_OPTIONS at Rich Lowe's 142 build is still very slow... On Tue, Jul 20, 2010 at 09:52:10AM -0700, Chad Cantwell wrote: Yes, I think this might have been it. I missed the NIGHTLY_OPTIONS variable in opensolaris and I think it was compiling a debug build. I'm not sure what the ramifications are of this or how much slower a debug build should be, but I'm recompiling a release build now so hopefully all will be well. Thanks, Chad On Tue, Jul 20, 2010 at 08:39:42AM +0100, Robert Milkowski wrote: On 20/07/2010 07:59, Chad Cantwell wrote: I've just compiled and booted into snv_142, and I experienced the same slow dd and scrubbing as I did with my 142 and 143 compilations and with the Nexanta 3 RC2 CD. So, this would seem to indicate a build environment/process flaw rather than a regression. Are you sure it is not a debug vs. non-debug issue? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On Tue, Jul 20, 2010 at 10:45:58AM -0700, Brent Jones wrote: On Tue, Jul 20, 2010 at 10:29 AM, Chad Cantwell c...@iomail.org wrote: No, this wasn't it. A non debug build with the same NIGHTLY_OPTIONS at Rich Lowe's 142 build is still very slow... On Tue, Jul 20, 2010 at 09:52:10AM -0700, Chad Cantwell wrote: Yes, I think this might have been it. I missed the NIGHTLY_OPTIONS variable in opensolaris and I think it was compiling a debug build. I'm not sure what the ramifications are of this or how much slower a debug build should be, but I'm recompiling a release build now so hopefully all will be well. Thanks, Chad On Tue, Jul 20, 2010 at 08:39:42AM +0100, Robert Milkowski wrote: On 20/07/2010 07:59, Chad Cantwell wrote: I've just compiled and booted into snv_142, and I experienced the same slow dd and scrubbing as I did with my 142 and 143 compilations and with the Nexanta 3 RC2 CD. So, this would seem to indicate a build environment/process flaw rather than a regression. Are you sure it is not a debug vs. non-debug issue? -- Robert Milkowski http://milek.blogspot.com Could it somehow not be compiling 64-bit support? -- Brent Jones br...@servuhome.net I thought about that but it says when it boots up that it is 64-bit, and I'm able to run 64-bit binaries. I wonder if it's compiling for the wrong processor optomization though? Maybe if it is missing some of the newer SSEx instructions the zpool checksum checking is slowed down significantly? I don't know how to check for this though and it seems strange it would slow it down this significantly. I'd expect even a non-SSE enabled binary to be able to calculate a few hundred MB of checksums per second for a 2.5+ghz processor. Chad ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
fyi, everyone, I have some more info here. in short, rich lowe's 142 works correctly (fast) on my hardware, while both my compilations (snv 143, snv 144) and also the nexanta 3 rc2 kernel (134 with backports) are horribly slow. I finally got around to trying rich lowe's snv 142 compilation in place of my own compilation of 143 (and later 144, not mentioned below), and unlike my own two compilations, his works very fast again on my same zpool ( scrubbing avg increased from low 100s to over 400 MB/s within a few minutes after booting into this copy of 142. I should note that since my original message, I also tried booting from a Nexanta Core 3.0 RC2 ISO after realizing it had zpool 26 support backported into 134 and was in fact able to read my zpool despite upgrading the version. Running a scrub from the F2 shell on the Nexanta CD was also slow scrubbing, just like the 143 and 144 that I compiled. So, there seem to be two possibilities. Either (and this seems unlikely) there is a problem introduced post-142 which slows things down, and it occured in 143, 144, and was brought back to 134 with Nexanta's backports, or else (more likely) there is something different or wrong with how I'm compiling the kernel that makes the hardware not perform up to its specifications with a zpool, and possibly the Nexanta 3 RC2 ISO has the same problem as my own compilations. Chad On Tue, Jul 06, 2010 at 03:08:50PM -0700, Chad Cantwell wrote: Hi all, I've noticed something strange in the throughput in my zpool between different snv builds, and I'm not sure if it's an inherent difference in the build or a kernel parameter that is different in the builds. I've setup two similiar machines and this happens with both of them. Each system has 16 2TB Samsung HD203WI drives (total) directly connected to two LSI 3081E-R 1068e cards with IT firmware in one raidz3 vdev. In both computers, after a fresh installation of snv 134, the throughput is a maximum of about 300 MB/s during scrub or something like dd if=/dev/zero bs=1024k of=bigfile. If I bfu to snv 138, I then get throughput of about 700 MB/s with both scrub or a single thread dd. I assumed at first this was some sort of bug or regression in 134 that made it slow. However, I've now tested also from the fresh 134 installation, compiling the OS/Net build 143 from the mercurial repository and booting into it, after which the dd throughput is still only about 300 MB/s just like snv 134. The scrub throughput in 143 is even slower, rarely surpassing 150 MB/s. I wonder if the scrubbing being extra slow here is related to the additional statistics displayed during the scrub that didn't used to be shown. Is there some kind of debug option that might be enabled in the 134 build and persist if I compile snv 143 which would be off if I installed a 138 through bfu? If not, it makes me think that the bfu to 138 is changing the configuration somewhere to make it faster rather than fixing a bug or being a debug flag on or off. Does anyone have any idea what might be happening? One thing I haven't tried is bfu'ing to 138, and from this faster working snv 138 installing the snv 143 build, which may possibly create a 143 that performs faster if it's simply a configuration parameter. I'm not sure offhand if installing source-compiled ON builds from a bfu'd rpool is supported, although I suppose it's simple enough to try. Thanks, Chad Cantwell ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On Tue, Jul 20, 2010 at 10:54:44AM +1000, James C. McPherson wrote: On 20/07/10 10:40 AM, Chad Cantwell wrote: fyi, everyone, I have some more info here. in short, rich lowe's 142 works correctly (fast) on my hardware, while both my compilations (snv 143, snv 144) and also the nexanta 3 rc2 kernel (134 with backports) are horribly slow. I finally got around to trying rich lowe's snv 142 compilation in place of my own compilation of 143 (and later 144, not mentioned below), and unlike my own two compilations, his works very fast again on my same zpool ( scrubbing avg increased from low 100s to over 400 MB/s within a few minutes after booting into this copy of 142. I should note that since my original message, I also tried booting from a Nexanta Core 3.0 RC2 ISO after realizing it had zpool 26 support backported into 134 and was in fact able to read my zpool despite upgrading the version. Running a scrub from the F2 shell on the Nexanta CD was also slow scrubbing, just like the 143 and 144 that I compiled. So, there seem to be two possibilities. Either (and this seems unlikely) there is a problem introduced post-142 which slows things down, and it occured in 143, 144, and was brought back to 134 with Nexanta's backports, or else (more likely) there is something different or wrong with how I'm compiling the kernel that makes the hardware not perform up to its specifications with a zpool, and possibly the Nexanta 3 RC2 ISO has the same problem as my own compilations. So - what's your env file contents, which closedbins are you using, why crypto bits are you using, and what changeset is your own workspace synced with? James C. McPherson -- Oracle http://www.jmcp.homeunix.com/blog The procedure I followed was basically what is outlined here: http://insanum.com/blog/2010/06/08/how-to-build-opensolaris using the SunStudio 12 compilers for ON and 12u1 for lint. For each build (143, 144) I cloned the exact tag for that build, i.e.: # hg clone ssh://a...@hg.opensolaris.org/hg/onnv/onnv-gate onnv-b144 # cd onnv-b144 # hg update onnv_144 Then I downloaded the corresponding closed and crypto bins from http://dlc.sun.com/osol/on/downloads/b143 or http://dlc.sun.com/osol/on/downloads/b144 The only environemnt variables I modified from the default opensolaris.sh file were the basic ones: GATE, CODEMGR_WS, STAFFER, and ON_CRYPTO_BINS to point to my work directory for the build, my username, and the relevant crypto bin: $ egrep -e ^GATE|^CODEMGR_WS|^STAFFER|^ON_CRYPTO_BINS opensolaris.sh GATE=onnv-b144; export GATE CODEMGR_WS=/work/compiling/$GATE; export CODEMGR_WS STAFFER=chad; export STAFFER ON_CRYPTO_BINS=$CODEMGR_WS/on-crypto-latest.$MACH.tar.bz2 I suppose the easiest way for me to confirm if there is a regression or if my compiling is flawed is to just try compiling snv_142 using the same procedure and see if it works as well as Rich Lowe's copy or if it's slow like my other compilations. Chad ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On Mon, Jul 19, 2010 at 06:00:04PM -0700, Brent Jones wrote: On Mon, Jul 19, 2010 at 5:40 PM, Chad Cantwell c...@iomail.org wrote: fyi, everyone, I have some more info here. in short, rich lowe's 142 works correctly (fast) on my hardware, while both my compilations (snv 143, snv 144) and also the nexanta 3 rc2 kernel (134 with backports) are horribly slow. I finally got around to trying rich lowe's snv 142 compilation in place of my own compilation of 143 (and later 144, not mentioned below), and unlike my own two compilations, his works very fast again on my same zpool ( scrubbing avg increased from low 100s to over 400 MB/s within a few minutes after booting into this copy of 142. I should note that since my original message, I also tried booting from a Nexanta Core 3.0 RC2 ISO after realizing it had zpool 26 support backported into 134 and was in fact able to read my zpool despite upgrading the version. Running a scrub from the F2 shell on the Nexanta CD was also slow scrubbing, just like the 143 and 144 that I compiled. So, there seem to be two possibilities. Either (and this seems unlikely) there is a problem introduced post-142 which slows things down, and it occured in 143, 144, and was brought back to 134 with Nexanta's backports, or else (more likely) there is something different or wrong with how I'm compiling the kernel that makes the hardware not perform up to its specifications with a zpool, and possibly the Nexanta 3 RC2 ISO has the same problem as my own compilations. Chad On Tue, Jul 06, 2010 at 03:08:50PM -0700, Chad Cantwell wrote: Hi all, I've noticed something strange in the throughput in my zpool between different snv builds, and I'm not sure if it's an inherent difference in the build or a kernel parameter that is different in the builds. I've setup two similiar machines and this happens with both of them. Each system has 16 2TB Samsung HD203WI drives (total) directly connected to two LSI 3081E-R 1068e cards with IT firmware in one raidz3 vdev. In both computers, after a fresh installation of snv 134, the throughput is a maximum of about 300 MB/s during scrub or something like dd if=/dev/zero bs=1024k of=bigfile. If I bfu to snv 138, I then get throughput of about 700 MB/s with both scrub or a single thread dd. I assumed at first this was some sort of bug or regression in 134 that made it slow. However, I've now tested also from the fresh 134 installation, compiling the OS/Net build 143 from the mercurial repository and booting into it, after which the dd throughput is still only about 300 MB/s just like snv 134. The scrub throughput in 143 is even slower, rarely surpassing 150 MB/s. I wonder if the scrubbing being extra slow here is related to the additional statistics displayed during the scrub that didn't used to be shown. Is there some kind of debug option that might be enabled in the 134 build and persist if I compile snv 143 which would be off if I installed a 138 through bfu? If not, it makes me think that the bfu to 138 is changing the configuration somewhere to make it faster rather than fixing a bug or being a debug flag on or off. Does anyone have any idea what might be happening? One thing I haven't tried is bfu'ing to 138, and from this faster working snv 138 installing the snv 143 build, which may possibly create a 143 that performs faster if it's simply a configuration parameter. I'm not sure offhand if installing source-compiled ON builds from a bfu'd rpool is supported, although I suppose it's simple enough to try. Thanks, Chad Cantwell ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'm surprised you're even getting 400MB/s on the fast configurations, with only 16 drives in a Raidz3 configuration. To me, 16 drives in Raidz3 (single Vdev) would do about 150MB/sec, as your slow speeds suggest. -- Brent Jones br...@servuhome.net With which drives and controllers? For a single dd thread writing a large file to fill up a new zpool from /dev/zero, in this configuration I can sustain over 700 MB/s for the duration of the process and can fill up the ~26t usable space overnight. This is with two 8 port LSI 1068e controllers and no expanders. RAIDZ operates similiar to regular raid and you should get striped speeds for sequential access minus any inefficiencies and processing time for the parity. 16 disks in raidz3 is 13 disks worth of striping so with ~700 MB/s I'm getting about 50% efficiency after the parity calculations etc which is fine with me. I understand that some people need to have higher performance random I/O to many
[zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
Hi all, I've noticed something strange in the throughput in my zpool between different snv builds, and I'm not sure if it's an inherent difference in the build or a kernel parameter that is different in the builds. I've setup two similiar machines and this happens with both of them. Each system has 16 2TB Samsung HD203WI drives (total) directly connected to two LSI 3081E-R 1068e cards with IT firmware in one raidz3 vdev. In both computers, after a fresh installation of snv 134, the throughput is a maximum of about 300 MB/s during scrub or something like dd if=/dev/zero bs=1024k of=bigfile. If I bfu to snv 138, I then get throughput of about 700 MB/s with both scrub or a single thread dd. I assumed at first this was some sort of bug or regression in 134 that made it slow. However, I've now tested also from the fresh 134 installation, compiling the OS/Net build 143 from the mercurial repository and booting into it, after which the dd throughput is still only about 300 MB/s just like snv 134. The scrub throughput in 143 is even slower, rarely surpassing 150 MB/s. I wonder if the scrubbing being extra slow here is related to the additional statistics displayed during the scrub that didn't used to be shown. Is there some kind of debug option that might be enabled in the 134 build and persist if I compile snv 143 which would be off if I installed a 138 through bfu? If not, it makes me think that the bfu to 138 is changing the configuration somewhere to make it faster rather than fixing a bug or being a debug flag on or off. Does anyone have any idea what might be happening? One thing I haven't tried is bfu'ing to 138, and from this faster working snv 138 installing the snv 143 build, which may possibly create a 143 that performs faster if it's simply a configuration parameter. I'm not sure offhand if installing source-compiled ON builds from a bfu'd rpool is supported, although I suppose it's simple enough to try. Thanks, Chad Cantwell ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
fyi to everyone, the Asus P5W64 motherboard previously in my opensolaris machine was the culprit, and not the general mpt issues. At the time the motherboard was originally put in that machine, there was not enough zfs i/o load to trigger the problem which led to the false impression the hardware was fine. I'm using a 5400 chipset xeon board now (asus dseb-gh) and my LSI cards are working perfectly again; over 2 hours of heavy I/O and no errors or warnings with snv 127 (with the P5W64/LSI combo with build 127 it would never run more than 15 minutes without warnings). I chose this board partly since it has PCI-X slots and I thought those might be useful for AOC-SAT2-MV8 cards if I couldn't shake the mpt issues, but now that the mpt issues are gone I can continue with that controller if I want. Thanks everyone for your help, Chad On Sun, Dec 06, 2009 at 11:12:50PM -0800, Chad Cantwell wrote: Thanks for the info on the yukon driver. I realize too many variables makes things impossible to determine, but I had made these hardware changes awhile back, and they seemed to work fine at the time. Since they aren't now, even in the older OpenSolaris (i've tried 2009.06 and 2008.11 now), the problem seems to be a hardware quirk, and the only way to narrow that down is to change hardware back until it works like it used to in at least the older snv builds. I've ruled out the ethernet controller. I'm leaning toward the current motherboard (Asus P5W64) not playing nicely with the LSI cards, but it will probably be several days until I get to the bottom of this since it takes awhile to test after making a change... Thanks, Chad On Mon, Dec 07, 2009 at 11:09:39AM +1000, James C. McPherson wrote: Gday Chad, the more swaptronics you partake in, the more difficult it is going to be for us (collectively) to figure out what is going wrong on your system. Btw, since you're running a build past 124, you can use the yge driver instead of the yukonx (from Marvell) or myk (from Murayama-san) drivers. As another comment in this thread has mentioned, a full scrub can be a serious test of your hardware depending on how much data you've got to walk over. If you can keep the hardware variables to a minimum then clarity will be more achievable. thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Thanks for the info on the yukon driver. I realize too many variables makes things impossible to determine, but I had made these hardware changes awhile back, and they seemed to work fine at the time. Since they aren't now, even in the older OpenSolaris (i've tried 2009.06 and 2008.11 now), the problem seems to be a hardware quirk, and the only way to narrow that down is to change hardware back until it works like it used to in at least the older snv builds. I've ruled out the ethernet controller. I'm leaning toward the current motherboard (Asus P5W64) not playing nicely with the LSI cards, but it will probably be several days until I get to the bottom of this since it takes awhile to test after making a change... Thanks, Chad On Mon, Dec 07, 2009 at 11:09:39AM +1000, James C. McPherson wrote: Gday Chad, the more swaptronics you partake in, the more difficult it is going to be for us (collectively) to figure out what is going wrong on your system. Btw, since you're running a build past 124, you can use the yge driver instead of the yukonx (from Marvell) or myk (from Murayama-san) drivers. As another comment in this thread has mentioned, a full scrub can be a serious test of your hardware depending on how much data you've got to walk over. If you can keep the hardware variables to a minimum then clarity will be more achievable. thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Hi all, Unfortunately for me, there does seem to be a hardware component to my problem. Although my rsync copied almost 4TB of data with no iostat errors after going back to OpenSolaris 2009.06, I/O on one of my mpt cards did eventually hang, with 6 disk lights on and 2 off, until rebooting. There are a few hardware changes made since the last time I did a full backup, so it's possible that whatever problem was introduced didn't happen frequently enough in low i/o usage for me to detect until now when I was reinstalling and copying massive amounts of data back. The changes I had made since originally installing osol2009.06 several months ago are: - stop using marvel yukon2 ethernet onboard driver (which used a 3rd party driver) in favor of intel 1000 pt dual port, which necessesitated an extra pci-e slot, prompting the following item: - swapped motherboards between 2 machines (they were similiar though, with similiar onboard hardware and shouldn't have been a major change). Originally was an Asus P5Q Deluxe w/3 pci-e slots, now is a slightly older Asus P5W64 w/4 pci-e slots. - the intel 1000 pt dual port card has been aggregated as aggr0 since it was installed (the older yukon2 was a basic interface) the above changes were what was done awhile ago before upgrading opensolaris to 127, and things seemed to be working fine for at least 2-3 months with rsync updating (never hung, or had a fatal zfs error or lost access to data requiring a reboot) new changes since troubleshooting snv 127 mpt issues: - upgrade LSI 3081 firmware from 1.28.2 (or was it .02) to 1.29, the latest. If this turns out to be an issue, I do have the previous IT firmware that I was using before which I can flash back. another, albeit unlikely factor: when I originally copied all my data to my first opensolaris raidz2 pool, I didn't use rsync at all, I used netcat tar, and only setup rsync later for updates. perhaps the huge initial single rsync of the large tree does something strange that the original intiial netcat tar copy did not (i know, unlikely, but I'm grasping at straws here to determine what has happened). I'll work on ruling out the potential sources of hardware problems before I report any more on the mpt issues, since my test case would probably confound things at this point. I am affected by the mpt bugs since I would get the timeouts almost constantly in snv 127+, but since I'm also apparently affected by some other unknown hardware issue, my data on the mpt problems might lead people in the wrong direction at this point. I will first try to go back to the non-aggregated yukon ethernet, remove the intel dual port pci-e network adapter, then if the problem persists try half of my drives on each LSI controller individually to confirm if one controller has a problem the other does not, or one drive in one set is causing a new problem to a particular controller. I hope to have some kind of answer at that point and not have to resort to motherboard swapping again. Chad On Thu, Dec 03, 2009 at 10:44:53PM -0800, Chad Cantwell wrote: I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and everything is working fine. The system has been running a few hours and copied a lot of data and not had any trouble, mpt syslog events, or iostat errors. One thing I found interesting, and I don't know if it's significant or not, is that under the recent builds and under 2009.06, I had run echo '::interrupts' | mdb -k to check the interrupts used. (I don't have the printout handy for snv 127+, though). I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris 2009.06, all 4 devices are on different IRQs. I don't know if this is significant, but most of my testing when I encountered errors was data transfer via the network, so it could have potentially been interfering with the mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool throughput as it is to be related to the network traffic. I'll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another array so it's not too big a deal to test later when the mpt drivers are looking better and wipe again in the event of problems) Chad
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
I was under the impression that the problem affecting most of us was introduced much later than b104, sometime between ~114 and ~118. When I first started using my LSI 3081 cards, they had the IR firmware on them, and it caused me all kinds of problems. The disks showed up but I couldn't write to them, I believe. Eventually I found that I needed the IT firmware for it to work properly, which is what I have used ever since, but maybe some builds do work with IR firmware? I remember, then, when I was originally trying to set them up with the IR firmware, Opensolaris saw my two cards as one device, whereas with the IT firmware they were always mpt0 and mpt1. Could also be the IR works with one card but not well when two cards are combine... Chad On Sat, Dec 05, 2009 at 02:47:55PM -0800, Calvin Morrow wrote: I found this thread after fighting the same problem in Nexenta which uses the OpenSolaris kernel from b104. Thankfully, I think I have (for the moment) solved my problem. Background: I have an LSI 3081e-R (1068E based) adapter which experiences the same disconnected command timeout error under relatively light load. This card connects to a Supermicro chassis using 2 MiniSAS cables to redundant expanders that are attached to 18 SAS drives. The card ran the latest IT firmware (1.29?). This server is a new install, and even installing from the CD to two disks in a mirrored ZFS root would randomly cause the disconnect error. The system remained unresponsive until after a reboot. I tried the workarounds mentioned in this thread, namely using set mpt:mpt_enable_msi = 0 and set xpv_psm:xen_support_msi = -1 in /etc/system. Once I added those lines, the system never really became unresponsive, however there were partial read and partial write messages that littered dmesg. At one point there appeared to be a disconnect error ( can not confirm ) that the system recovered from. Eventually, I became desperate and flashed the IR (Integrated Raid) firmware over the top of the IT firmware. Since then, I have had no errors in dmesg of any kind. I even removed the workarounds from /etc/system and still have had no issues. The mpt driver is exceptionally quiet now. I'm interested to know if anyone who has a 1068E based card is having these problems using the IR firmware, or if they all seem to be IT (initiator target) related. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and everything is working fine. The system has been running a few hours and copied a lot of data and not had any trouble, mpt syslog events, or iostat errors. One thing I found interesting, and I don't know if it's significant or not, is that under the recent builds and under 2009.06, I had run echo '::interrupts' | mdb -k to check the interrupts used. (I don't have the printout handy for snv 127+, though). I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris 2009.06, all 4 devices are on different IRQs. I don't know if this is significant, but most of my testing when I encountered errors was data transfer via the network, so it could have potentially been interfering with the mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool throughput as it is to be related to the network traffic. I'll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another array so it's not too big a deal to test later when the mpt drivers are looking better and wipe again in the event of problems) Chad On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote: To update everyone, I did a complete zfs scrub, and it it generated no errors in iostat, and I have 4.8T of data on the filesystem so it was a fairly lengthy test. The machine also has exhibited no evidence of instability. If I were to start copying a lot of data to the filesystem again though, I'm sure it would generate errors and crash again. Chad On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: Well, ok, the msi=0 thing didn't help after all. A few minutes after my last message a few errors showed up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just doing a scrub instead of my rsync process and see how that does. Chad On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30
Re: [zfs-discuss] mpt errors on snv 127
I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can't init the config space (that's the pci bus config space), then you've got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Well, ok, the msi=0 thing didn't help after all. A few minutes after my last message a few errors showed up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just doing a scrub instead of my rsync process and see how that does. Chad On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can't init the config space (that's the pci bus config space), then you've got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
First I tried just upgrading to b127, that had a few issues besides the mpt driver. After that I did a clean install of b127, but no I don't have my osol2009.06 root still there. I wasn't sure how to install another copy and leave it there (I suspect it is possible, since I saw when doing upgrades it creates a second root environment, but my forte isn't solaris so I just reformatted the root device) On Tue, Dec 01, 2009 at 08:09:32AM -0500, Mark Johnson wrote: Chad Cantwell wrote: Hi, I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. Just to be clear... The same setup was working fine on osol2009.06, you upgraded to b127 and it started failing? Did you keep the osol2009.06 be around so you can reboot back to it? If so, have you tried the osol2009.06 mpt driver in the BE with the latest bits (make sure you make a backup copy of the mpt driver)? MRJ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
To update everyone, I did a complete zfs scrub, and it it generated no errors in iostat, and I have 4.8T of data on the filesystem so it was a fairly lengthy test. The machine also has exhibited no evidence of instability. If I were to start copying a lot of data to the filesystem again though, I'm sure it would generate errors and crash again. Chad On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: Well, ok, the msi=0 thing didn't help after all. A few minutes after my last message a few errors showed up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just doing a scrub instead of my rsync process and see how that does. Chad On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can't init the config space (that's the pci bus config space), then you've got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] mpt errors on snv 127
Hi, Sorry for not replying to one of the already open threads on this topic; I've just joined the list for the purposes of this discussion and have nothing in my client to reply to yet. I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with 8 drives each. The LSI cards are flashed with IT firmware from Feb 2009 (I think, I can double check if it's important). The drives are Samsung HD154UI 1.5TB disks. I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. I hadn't known at the time about the current mpt issues, or I may have held off on upgrading. I installed Solaris Nevada build 127 from the DVD image. I then proceed to setup a raidz3 pool with the same disks as before, of a slightly smaller size (obviously) than the former raidz2 pool. I started a moderately long-running and heavy load rsync to copy my data back to the pool from another host. Several times during the day (sometimes a couple times an hour, or it could go up to a few hours with no errors), I get several syslog errors and warnings about mpt, similiar but not identical to what I've seen reported here by others. Also, iostat -en shows several hw and trn errors of varying amounts for all the drives (in OpenSolaris 2009.06 I never had any iostat errors). After awhile the machine will hang in a variety of ways. The first time it was pingable, and I could authenticate through ssh but it would never spawn a shell. The second time it crashed it was unpingable from the network, and the display was black, although the numlock key was still toggling properly the numlock light on the console. Here's a sample of my errors. I've included the complete series of errors from one timestampe, and a few lines from a subsequent series of errors a couple minutes later: (if there's any other info I can provide or more things to test just let me know. Thanks, --Chad ) Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 29 04:42:55 the-vault mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120200 Nov 29 04:42:55 the-vault scsi: [ID 243001 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1):
Re: [zfs-discuss] Workaround for mpt timeouts in snv_127
Hi, I just posted a summary of a similiar issue I'm having with non-Sun hardware. For the record, it's in a Chenbro RM41416 chassis with 4 chenbro SAS backplanes but no expanders (each backplane is 4 disks connected by SFF-8087 cable). Each of my LSI brand SAS3081E PCI-E cards is connected to two backplanes with 1m SFF-8087 (both ends) cables. For more details if they are important see my other post. I haven't tried the MSI workaround yet (although I'm not sure what MSI is) but from what I've read the workaround won't fix the issues in my case with non-sun hardware. Thanks, Chad On Tue, Dec 01, 2009 at 12:36:33PM +1000, James C. McPherson wrote: Hi all, I believe it's an accurate summary of the emails on this thread over the last 18 hours to say that (1) disabling MSI support in xVM makes the problem go away (2) disabling MSI support on bare metal when you only have disks internal to your host (no jbods), makes the problem go away (several reports of this) (3) disabling MSI support on bare metal when you have a non-Sun jbod (and cables) does _not_ make the problem go away. (several reports of this) (4) the problem is not seen with a Sun-branded jbod and cables (only one report of this) (5) problem is seen with both mpt(7d) and itmpt(7d). (6) mpt(7d) without MSI support is sloow. For those who've been suffering this problem and who have non-Sun jbods, could you please let me know what model of jbod and cables (including length thereof) you have in your configuration. For those of you who have been running xVM without MSI support, could you please confirm whether the devices exhibiting the problem are internal to your host, or connected via jbod. And if via jbod, please confirm the model number and cables. Please note that Jianfei and I are not making assumptions about the root cause here, we're just trying to nail down specifics of what seems to be a likely cause. thankyou in advance, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Hi, Replied to your previous general query already, but in summary, they are in the server chassis. It's a Chenbro 16 hotswap bay case. It has 4 mini backplanes that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives per card). Chad On Tue, Dec 01, 2009 at 01:02:34PM +1000, James C. McPherson wrote: Chad Cantwell wrote: Hi, Sorry for not replying to one of the already open threads on this topic; I've just joined the list for the purposes of this discussion and have nothing in my client to reply to yet. I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with 8 drives each. Are these disks internal to your server's chassis, or external in a jbod? If in a jbod, which one? Also, which cables are you using? thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Hi, The Chenbro chassis contains everything - the motherboard/CPU, and the disks. As far as I know the chenbro backplanes are basically electrical jumpers that the LSI cards shouldn't be aware of. They pass through the SATA signals directly from SFF-8087 cables to the disks. Thanks, Chad On Tue, Dec 01, 2009 at 01:43:06PM +1000, James C. McPherson wrote: Chad Cantwell wrote: Hi, Replied to your previous general query already, but in summary, they are in the server chassis. It's a Chenbro 16 hotswap bay case. It has 4 mini backplanes that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives per card). Hi Chad, thanks for the followup. Just to confirm - you've got this Chenbro chassis connected to the actual server chassis (where the cpu is), or do you have the cpu inside the Chenbro chassis? thankyou, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 22:38:21 the-vault mpt_config_space_init failed Nov 30 22:38:22 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 22:38:22 the-vault LSI PCI device (1000,) not supported. Nov 30 22:38:22 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 22:38:22 the-vault mpt_config_space_init failed Nov 30 22:38:46 the-vault sshd[636]: [ID 800047 auth.crit] monitor fatal: protocol error during kex, no DH_GEX_REQUEST: 254 Nov 30 22:38:46 the-vault sshd[637]: [ID 800047 auth.crit] fatal: Protocol error in privilege separation; expected packet type 254, got 20 Nov 30 23:11:23 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 23:11:23 the-vault mpt_send_handshake_msg task 3 failed Nov 30 23:11:23 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 23:11:23 the-vault LSI PCI device (1000,) not supported. Nov 30 23:11:23 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 23:11:23 the-vault mpt_config_space_init failed Nov 30 23:11:25 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 23:11:25 the-vault LSI PCI device (1000,) not supported. Nov 30 23:11:25 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 23:11:25 the-vault mpt_config_space_init failed Nov 30 23:11:25 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 23:11:25 the-vault mpt_restart_ioc failed (and that's the last message before I hit the reset button. Host was unpingable, and just moving the mouse around on the screen was extremely delayed) Nov 30 23:32:05 the-vault genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version snv_127 64-bit Nov 30 23:32:05 the-vault genunix: [ID 943908 kern.notice] Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved. Also, it says it resilvered some data; this is the first time I've seen any notes next to a devices. Still no zpool errors though. # zpool status vault pool: vault state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Mon Nov 30 23:33:16 2009 config: NAME STATE READ WRITE CKSUM vaultONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 11.5K resilvered c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 errors: No known data errors # On Mon, Nov 30, 2009 at 06:46:13PM -0800, Chad Cantwell wrote: Hi, Sorry for not replying to one of the already open threads on this topic; I've just joined the list for the purposes of this discussion and have nothing in my client to reply to yet. I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with 8 drives each. The LSI cards are flashed with IT firmware from Feb 2009 (I think, I can double check if it's important). The drives are Samsung HD154UI 1.5TB disks. I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. I hadn't known at the time about the current mpt issues, or I may have held off on upgrading. I installed Solaris Nevada build 127 from the DVD image. I then proceed to setup a raidz3 pool with the same disks as before, of a slightly smaller size (obviously) than the former raidz2 pool. I started a moderately long-running and heavy load rsync to copy my data back to the pool from another