Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?

2011-10-31 Thread Richard Elling
FWIW, we recommend disabling C-states in the BIOS for NexentaStor systems.
C-states are evil.
 -- richard

On Oct 31, 2011, at 9:46 PM, Lachlan Mulcahy wrote:

> Hi All,
> 
> 
> We did not have the latest firmware on the HBA - through a lot of pain I 
> managed to boot into an MS-DOS disk and run the firmware update. We're now 
> running the latest on this card from the LSI.com website. (both HBA BIOS and 
> Firmware)
> 
> No joy.. the system seized up again within a few hours of coming back up. 
> 
> Now trying another suggestion sent to me by a direct poster:
> 
> *   Recommendation from Sun (Oracle) to work around a bug:
> *   6958068 - Nehalem deeper C-states cause erratic scheduling behavior
> set idle_cpu_prefer_mwait = 0
> set idle_cpu_no_deep_c = 1
> 
> Was apparently the cause of a similar symptom for them and we are using 
> Nehalem.
> 
> At this point I'm running out of options, so it can't hurt to try it.
> 
> Regards,
> -- 
> Lachlan Mulcahy
> Senior DBA, 
> Marin Software Inc.
> San Francisco, USA
> 
> AU Mobile: +61 458 448 721
> US Mobile: +1 (415) 867 2839
> Office : +1 (415) 671 6080
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?

2011-10-31 Thread Lachlan Mulcahy
Hi All,


We did not have the latest firmware on the HBA - through a lot of pain I
> managed to boot into an MS-DOS disk and run the firmware update. We're now
> running the latest on this card from the LSI.com website. (both HBA BIOS
> and Firmware)
>

No joy.. the system seized up again within a few hours of coming back up.

Now trying another suggestion sent to me by a direct poster:

*   Recommendation from Sun (Oracle) to work around a bug:
*   6958068 - Nehalem deeper C-states cause erratic scheduling behavior
set idle_cpu_prefer_mwait = 0
set idle_cpu_no_deep_c = 1

Was apparently the cause of a similar symptom for them and we are using
Nehalem.

At this point I'm running out of options, so it can't hurt to try it.

Regards,
-- 
Lachlan Mulcahy
Senior DBA,
Marin Software Inc.
San Francisco, USA

AU Mobile: +61 458 448 721
US Mobile: +1 (415) 867 2839
Office : +1 (415) 671 6080
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor relative performance of SAS over SATA drives

2011-10-31 Thread weiliam.hong

Thanks for the reply.

On 11/1/2011 11:03 AM, Richard Elling wrote:

On Oct 26, 2011, at 7:56 PM, weiliam.hong wrote:

Questions:
1. Why does SG SAS drives degrade to<10 MB/s while WD RE4 remain consistent 
at>100MB/s after 10-15 min?
2. Why does SG SAS drive show only 70+ MB/s where is the published figures are> 
 100MB/s refer here?

Are the SAS drives multipathed? If so, do you have round-robin (default in most 
Solaris distros) or logical-block?


Physically, the SAS drives are not multipathed as I connected them 
directly to the HBA. I also disable multipathing via mpt_sas.conf.


Regards,


3. All 4 drives are connected to a single HBA, so I assume the mpt_sas driver 
is used. Are SAS and SATA drives handled differently ?

Yes. SAS disks can be multipathed, SATA disks cannot.
  -- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor relative performance of SAS over SATA drives

2011-10-31 Thread Richard Elling
On Oct 26, 2011, at 7:56 PM, weiliam.hong wrote:
> 
> Questions:
> 1. Why does SG SAS drives degrade to <10 MB/s while WD RE4 remain consistent 
> at >100MB/s after 10-15 min?
> 2. Why does SG SAS drive show only 70+ MB/s where is the published figures 
> are > 100MB/s refer here?

Are the SAS drives multipathed? If so, do you have round-robin (default in most 
Solaris distros) or logical-block?

> 3. All 4 drives are connected to a single HBA, so I assume the mpt_sas driver 
> is used. Are SAS and SATA drives handled differently ?

Yes. SAS disks can be multipathed, SATA disks cannot.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA '11, Boston, MA, December 4-9 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?

2011-10-31 Thread Lachlan Mulcahy
Hi All/Marion,

A small update...


known to have lockups/timeouts when used with SAS expanders (disk
>> enclosures)
>> with incompatible firmware revisions, and/or with older mpt drivers.
>>
>
> I'll need to check that out -- I'm 90% sure that these are fresh out of
> box HBAs.
>
> Will try an upgrade there and see if we get any joy there...
>

We did not have the latest firmware on the HBA - through a lot of pain I
managed to boot into an MS-DOS disk and run the firmware update. We're now
running the latest on this card from the LSI.com website. (both HBA BIOS
and Firmware)


>  The MD1220 is a 6Gbit/sec device.  You may be better off with a matching
>> HBA  -- Dell has certainly told us the MD1200-series is not intended for
>> use with the 3Gbit/sec HBA's.  We're doing fine with the LSI SAS 9200-8e,
>> for example, when connecting to Dell MD1200's with the 2TB "nearline SAS"
>> disk drives.
>>
>
> I was aware the MD1220 is a 6G device, but I figured that since our IO
> throughput doesn't actually come close to saturating 3Gbit/sec that it
> would just operate at the lower speed and be OK. I guess it is something to
> look at if I run out of other options...
>

This was my mistake - this particular system has MD1120s attached to it. We
have a mix of 1220s and 1120s since we've been with Dell since the 1120s
were current model.

Just kicked off the system running with the same logging as before with
this new firmware, so I'll see if this goes any better.

Regards,
-- 
Lachlan Mulcahy
Senior DBA,
Marin Software Inc.
San Francisco, USA

AU Mobile: +61 458 448 721
US Mobile: +1 (415) 867 2839
Office : +1 (415) 671 6080
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?

2011-10-31 Thread Lachlan Mulcahy
Hi Marion,

Thanks for your swifty reply!

Have you got the latest firmware on your LSI 1068E HBA's?  These have been
> known to have lockups/timeouts when used with SAS expanders (disk
> enclosures)
> with incompatible firmware revisions, and/or with older mpt drivers.
>

I'll need to check that out -- I'm 90% sure that these are fresh out of box
HBAs.

Will try an upgrade there and see if we get any joy there...


> The MD1220 is a 6Gbit/sec device.  You may be better off with a matching
> HBA  -- Dell has certainly told us the MD1200-series is not intended for
> use with the 3Gbit/sec HBA's.  We're doing fine with the LSI SAS 9200-8e,
> for example, when connecting to Dell MD1200's with the 2TB "nearline SAS"
> disk drives.
>

I was aware the MD1220 is a 6G device, but I figured that since our IO
throughput doesn't actually come close to saturating 3Gbit/sec that it
would just operate at the lower speed and be OK. I guess it is something to
look at if I run out of other options...


Last, are you sure it's memory-related?  You might keep an eye on "
> arcstat.pl"
> output and see what the ARC sizes look like just prior to lockup.  Also,
> maybe you can look up instructions on how to force a crash dump when the
> system hangs -- one of the experts around here could tell a lot from a
> crash dump file.
>

I'm starting to doubt that it is a memory issue now -- especially since I
now have some results from my latest "test"...

output of arcstat.pl looked like this just prior to the lock up:

19:57:3624G   24G   94   16161   194   1   1
19:57:4124G   24G   96   17462   213   0   0
time  arcsz c  mh%  mhit  hit%  hits  l2hit%  l2hits
19:57:4623G   24G   94   16162   192   1   1
19:57:5124G   24G   96   16963   205   0   0
19:57:5624G   24G   95   16961   206   0   0

^-- This is the very last line printed...

I actually discovered and rebooted the machine via DRAC at around 20:44, so
it had been in it's bad state for around 1 hour.

Some snippets from the output some 20 minutes earlier shows the point at
while the arcsz grew to reach the maximum:

time  arcsz c  mh%  mhit  hit%  hits  l2hit%  l2hits
19:36:4521G   24G   95   15258   177   0   0
19:37:0022G   24G   95   15657   182   0   0
19:37:1522G   24G   95   15959   185   0   0
19:37:3023G   24G   94   15358   178   0   0
19:37:4523G   24G   95   16959   195   0   0
19:38:0024G   24G   95   16059   187   0   0
19:38:2524G   24G   96   15158   177   0   0

So it seems that arcsz reaching the 24G maximum wasn't necessarily to
blame, since the system operated for a good 20mins in this state.

I was also logging "vmstat 5" prior to the crash (though I forgot to
include some timestamps in my output) and these are the final lines
recorded in that log:

 kthr  memorypagedisk  faults  cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy
id
 0 0 0 25885248 18012208 71 2090 0 0 0 0 0  0  0  0 22 17008 210267 30229 1
5 94
 0 0 0 25884764 18001848 71 2044 0 0 0 0 0  0  0  0 25 14846 151228 25911 1
5 94
 0 0 0 25884208 17991876 71 2053 0 0 0 0 0  0  0  0  8 16343 185416 28946 1
5 93

So it seems there was some 17-18G free in the system when the lock up
occurred. Curious...

I was also capturing some arc info from mdb -k  and the output prior to the
lock up was...

Monday, October 31, 2011 07:57:51 PM UTC
arc_no_grow   = 0
arc_tempreserve   = 0 MB
arc_meta_used =  4621 MB
arc_meta_limit= 20480 MB
arc_meta_max  =  4732 MB
Monday, October 31, 2011 07:57:56 PM UTC
arc_no_grow   = 0
arc_tempreserve   = 0 MB
arc_meta_used =  4622 MB
arc_meta_limit= 20480 MB
arc_meta_max  =  4732 MB

Looks like metadata was not primarily responsible for consuming all of that
24G of ARC in arcstat.pl output...

Also seems nothing interesting in /var/adm/messages leading up to my
rebooting :

Oct 31 18:42:57 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency
error 512 PPM exceeds tolerance 500 PPM
Oct 31 18:44:01 mslvstdp02r last message repeated 1 time
Oct 31 18:45:05 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency
error 512 PPM exceeds tolerance 500 PPM
Oct 31 18:46:09 mslvstdp02r last message repeated 1 time
Oct 31 18:47:23 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency
error 505 PPM exceeds tolerance 500 PPM
Oct 31 19:06:13 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency
error 505 PPM exceeds tolerance 500 PPM
Oct 31 19:09:27 mslvstdp02r last message repeated 4 times
Oct 31 19:25:04 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency
error 505 PPM exceeds tolerance 500 PPM
Oct 31 19:28:17 mslvstdp02r last m

Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?

2011-10-31 Thread Marion Hakanson
lmulc...@marinsoftware.com said:
> . . .
> The MySQL server is:
> Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks
> each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD)
> 
> I have also seen similar symptoms on systems with MD1000 disk arrays
> containing 2TB 7200RPM SATA drives.
> 
> The only thing of note that seems to show up in the /var/adm/messages file on
> this MySQL server is:
> 
> Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/
> pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt
> request inquiry page 0x89 for SATA target:58 failed! Oc
> . . .

Have you got the latest firmware on your LSI 1068E HBA's?  These have been
known to have lockups/timeouts when used with SAS expanders (disk enclosures)
with incompatible firmware revisions, and/or with older mpt drivers.

The MD1220 is a 6Gbit/sec device.  You may be better off with a matching
HBA  -- Dell has certainly told us the MD1200-series is not intended for
use with the 3Gbit/sec HBA's.  We're doing fine with the LSI SAS 9200-8e,
for example, when connecting to Dell MD1200's with the 2TB "nearline SAS"
disk drives.

Last, are you sure it's memory-related?  You might keep an eye on "arcstat.pl"
output and see what the ARC sizes look like just prior to lockup.  Also,
maybe you can look up instructions on how to force a crash dump when the
system hangs -- one of the experts around here could tell a lot from a
crash dump file.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?

2011-10-31 Thread Lachlan Mulcahy
Hi Folks,

I have been having issues with Solaris kernel based systems "locking up"
and am wondering if anyone else has observed a similar symptom before.

Some information/background...

Systems the symptom has presented on:  NFS server (Nexenta Core 3.01) and a
MySQL Server (Sol 11 Express).

The issue presents itself as almost total unresponsiveness -- Cannot SSH to
the host any longer, access on the local console (via Dell Remote Access
Console) is also unresponsive.

The only case I have seen some level of responsiveness is in the case of a
MySQL server... I was able to connect to the server and issue extremely
basic commands like SHOW PROCESSLIST -- anything else would just hang.

I feel like this could be explained by the fact that MySQL keeps a thread
cache (no need to allocate memory for a new thread on incoming connection)
and SHOW PROCESSLIST can be served almost entirely from allocated memory
structures.

The NFS server has 48G physical memory and no specifically tuned ZFS
settings in /etc/system.

The MySQL server has 80G physical memory and I have had a variety of ZFS
tuning settings -- this is now that system that I am primarily focused in
on troubleshooting...

The primary cache for the MySQL data zpool is set for metadata only (InnoDB
has it's own buffer pool for data) and I have prefetch disabled, since
InnoDB also does it's own prefetching...

Originally when the lock up was first observed I had limited ARC to 4G (to
allow most memory to MySQL), but then I saw this lock up happen.

I then tuned the server thinking I wasn't allowing ZFS enough breathing
room -- I didn't realise how much metadata can really consume for a 20TB
zpool!

So I removed the ARC limit and set InnoDB buffer pool to 54G, down from the
previous setting of 64G ... This should allow about 26G to the kernel and
ZFS

The server ran fine for a few days, but then the symptom showed up again...
I rebooted the machine and interestingly while MySQL was doing crash
recovery, the system locked up yet again!..

Hardware wise we are using mostly Dell gear.

The MySQL server is:

Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks
each - 600GB 10k RPM SAS Drives
Storage Controller: LSI, Inc. 1068E (JBOD)

I have also seen similar symptoms on systems with MD1000 disk arrays
containing 2TB 7200RPM SATA drives.

The only thing of note that seems to show up in the /var/adm/messages file
on this MySQL server is:

Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0
,0/pci8086,3410@9/pci1000,3080@0 (mpt0):
Oct 31 18:24:51 mslvstdp02r mpt request inquiry page 0x89 for SATA
target:58 failed!
Oct 31 18:24:52 mslvstdp02r scsi: [ID 583861 kern.info] ses0 at mpt0:
unit-address 58,0: target 58 lun 0
Oct 31 18:24:52 mslvstdp02r genunix: [ID 936769 kern.info] ses0 is /pci@0
,0/pci8086,3410@9/pci1000,3080@0/ses@58,0
Oct 31 18:24:52 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0
,0/pci8086,3410@9/pci1000,3080@0/ses@58,0 (ses0) online
Oct 31 18:24:52 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0
,0/pci8086,3410@9/pci1000,3080@0 (mpt0):
Oct 31 18:24:52 mslvstdp02r mpt request inquiry page 0x89 for SATA
target:59 failed!
Oct 31 18:24:53 mslvstdp02r scsi: [ID 583861 kern.info] ses1 at mpt0:
unit-address 59,0: target 59 lun 0
Oct 31 18:24:53 mslvstdp02r genunix: [ID 936769 kern.info] ses1 is /pci@0
,0/pci8086,3410@9/pci1000,3080@0/ses@59,0
Oct 31 18:24:53 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0
,0/pci8086,3410@9/pci1000,3080@0/ses@59,0 (ses1) online


I'm thinking that the issue is memory related, so the current test I am
running is:

ZFS tuneables:

/etc/system:

# Limit the amount of memory the ARC cache will use
# See this link:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
# Limit to 24G
set zfs:zfs_arc_max = 25769803776
# Limit meta data to 20GB
set zfs:zfs_arc_meta_limit = 21474836480
# Disable ZFS prefetch - InnoDB Does its own
set zfs:zfs_prefetch_disable = 1


MySQL memory: Set Innodb buffer pool size to 44G (down another 10G from
54G).. That should allow 44+24=68 for ARC and MySQL and 12G for anything
else that I haven't considered...

I am using arcstat.pl to collect/write stats on arc size, hit ratio,
requests, etc. to a file every 5 seconds. and vmstat also every 5 seconds.

I'm hoping that should the issue present itself again, that I can find a
possible cause, but I'm really concerned about this issue - we want to make
use of ZFS in production, but this seemingly inexplicable lock ups are not
filling us with confidence :(

Has anyone seen similar things before and do you have any suggestions for
what else I should consider looking at?

Thanks and Regards,
-- 
Lachlan Mulcahy
Senior DBA,
Marin Software Inc.
San Francisco, USA

AU Mobile: +61 458 448 721
US Mobile: +1 (415) 867 2839
Office : +1 (415) 671 6080
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
htt

Re: [zfs-discuss] Log disk with all ssd pool?

2011-10-31 Thread Karl Rossing

On 10/28/2011 01:04 AM, Mark Wolek wrote:
before the forum closed. 

Did I miss something?

Karl





CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor relative performance of SAS over SATA drives

2011-10-31 Thread weiliam.hong

Thanks for the reply.

Some background.. The server is fresh installed. Right before running 
the tests,  the pools are newly created.


Some comments below

On 10/31/2011 10:33 PM, Paul Kraus wrote:

A couple points in line below ...

On Wed, Oct 26, 2011 at 10:56 PM, weiliam.hong  wrote:


I have a fresh installation of OI151a:
- SM X8DTH, 12GB RAM, LSI 9211-8i (latest IT-mode firmware)
- pool_A : SG ES.2 Constellation (SAS)
- pool_B : WD RE4 (SATA)
- no settings in /etc/system
Load generation via 2 concurrent dd streams:
--
dd if=/dev/zero of=/pool_A/bigfile bs=1024k count=100
dd if=/dev/zero of=/pool_B/bigfile bs=1024k count=100

dd generates "straight line" data, all sequential.

yes.

capacity operationsbandwidth
poolalloc   free   read  write   read  write
--  -  -  -  -  -  -
pool_A  15.5G  2.70T  0 50  0  6.29M
   mirror15.5G  2.70T  0 50  0  6.29M
 c7t5000C50035062EC1d0  -  -  0 62  0  7.76M
 c8t5000C50034C03759d0  -  -  0 50  0  6.29M
--  -  -  -  -  -  -
pool_B  28.0G  1.79T  0  1.07K  0   123M
   mirror28.0G  1.79T  0  1.07K  0   123M
 c1t50014EE057FCD628d0  -  -  0  1.02K  0   123M
 c2t50014EE6ABB89957d0  -  -  0  1.02K  0   123M

What does `iostat -xnM c7t5000C50035062EC1d0 c8t5000C50034C03759d0
c1t50014EE057FCD628d0 c2t50014EE6ABB89957d0 1` show ? That will give
you much more insight into the OS<->  drive interface.
iostat numbers are similar.  I will try to get the figures, a bit hard 
now as the hardware has been taken off my hands.

What does `fsstat /pool_A /pool_B 1` show ? That will give you much
more insight into the application<->  filesystem interface. In this
case "application" == "dd".

In my opinion, `zpool iostat -v` is somewhat limited in what you can
learn from it. The only thing I use it for these days is to see
distribution of data and I/O between vdevs.


Questions:
1. Why does SG SAS drives degrade to<10 MB/s while WD RE4 remain consistent
at>100MB/s after 10-15 min?

Something changes to slow them down ? Sorry for the obvious retort :-)
See what iostat has to say. If the %b column is climbing, then you are
slowly saturating the drives themselves, for example.
There is no other workload or user using this system. The system is 
freshly installed, booted and the pools newly created.

2. Why does SG SAS drive show only 70+ MB/s where is the published figures
are>  100MB/s refer here?

"published" where ?

http://www.seagate.com/www/en-au/products/enterprise-hard-drives/constellation-es/constellation-es-2/#tTabContentSpecifications



  What does a "dd" to the device itself (no ZFS, no
FS at all) show ? For example, `dd if=/dev/zero
of=/dev/dsk/c7t5000C50035062EC1d0s0 bs=1024k count=100` (after you
destroy the zpool and use format to create an s0 of the entire disk).
This will test the device driver / HBA / drive with no FS or volume
manager involved. Use iostat to watch the OS<->  drive interface.

Perhaps the test below is useful to understand the observation.

*dd test on slice 0*
dd if=/dev/zero of=/dev/rdsk/c1t5000C50035062EC1d0s0 bs=1024k

extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0  155.40.0 159129.7  0.0  1.00.06.3   0  97 c1
0.0  155.40.0 159129.7  0.0  1.00.06.3   0  97 
c1t5000C50035062EC1d0 <== this is best case


*dd test on slice 6*
**dd if=/dev/zero of=/dev/rdsk/c1t5000C50035062EC1d0s6 bs=1024k

extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0   21.40.0 21913.6  0.0  1.00.0   46.6   0 100 c1
0.0   21.40.0 21913.6  0.0  1.00.0   46.6   0 100 
c1t5000C50035062EC1d0 <== only 20+MB/s !!!


*Partition table info*

Part  TagFlag First Sector  Size  Last Sector
  0usrwm   256   100.00GB   209715455
  1 unassignedwm 000
  2 unassignedwm 000
  3 unassignedwm 000
  4 unassignedwm 000
  5 unassignedwm 000
  6usrwm5650801295   100.00GB   5860516749
  8   reservedwm5860516751 8.00MB   5860533134

Referring to pg 18 of
http://www.seagate.com/staticfiles/support/docs/manual/enterprise/Constellation%203_5%20in/100628615f.pdf
The transfer rate is supposed range from 68 - 155 MB/s.  Why is the 
inner cylinders only showing 20+ MB/s ? Am I testing and understanding 
this wrongly ?





3. All 4 drives are connected to a singl

Re: [zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level

2011-10-31 Thread Paul Kraus
On Sat, Oct 29, 2011 at 1:57 PM, Jim Klimov  wrote:

>  I am catching up with some 500 posts that I skipped this
> summer, and came up with a new question. In short, is it
> possible to add "restartability" to ZFS SEND, for example
> by adding artificial snapshots (of configurable increment
> size) into already existing datasets [too large to be
> zfs-sent successfully as one chunk of stream data]?

 We addressed this by decreasing our snapshot interval from 1 day
to 1 hour. We rarely have a snapshot bigger than a few GB now. I keep
meaning to put together a snapshot script that takes a new snapshot
when the amount of changed data increases to a certain point (for
example, take a snapshot whenever the snapshot would contain 250 MB of
data). Not enough round toits with all the other broken stuff to fix
:-(

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Poor relative performance of SAS over SATA drives

2011-10-31 Thread Paul Kraus
A couple points in line below ...

On Wed, Oct 26, 2011 at 10:56 PM, weiliam.hong  wrote:

> I have a fresh installation of OI151a:
> - SM X8DTH, 12GB RAM, LSI 9211-8i (latest IT-mode firmware)
> - pool_A : SG ES.2 Constellation (SAS)
> - pool_B : WD RE4 (SATA)
> - no settings in /etc/system

> Load generation via 2 concurrent dd streams:
> --
> dd if=/dev/zero of=/pool_A/bigfile bs=1024k count=100
> dd if=/dev/zero of=/pool_B/bigfile bs=1024k count=100

dd generates "straight line" data, all sequential.

>    capacity operations    bandwidth
> pool    alloc   free   read  write   read  write
> --  -  -  -  -  -  -
> pool_A  15.5G  2.70T  0 50  0  6.29M
>   mirror    15.5G  2.70T  0 50  0  6.29M
>     c7t5000C50035062EC1d0  -  -  0 62  0  7.76M
>     c8t5000C50034C03759d0  -  -  0 50  0  6.29M
> --  -  -  -  -  -  -
> pool_B  28.0G  1.79T  0  1.07K  0   123M
>   mirror    28.0G  1.79T  0  1.07K  0   123M
>     c1t50014EE057FCD628d0  -  -  0  1.02K  0   123M
>     c2t50014EE6ABB89957d0  -  -  0  1.02K  0   123M

What does `iostat -xnM c7t5000C50035062EC1d0 c8t5000C50034C03759d0
c1t50014EE057FCD628d0 c2t50014EE6ABB89957d0 1` show ? That will give
you much more insight into the OS <-> drive interface.

What does `fsstat /pool_A /pool_B 1` show ? That will give you much
more insight into the application <-> filesystem interface. In this
case "application" == "dd".

In my opinion, `zpool iostat -v` is somewhat limited in what you can
learn from it. The only thing I use it for these days is to see
distribution of data and I/O between vdevs.

> Questions:
> 1. Why does SG SAS drives degrade to <10 MB/s while WD RE4 remain consistent
> at >100MB/s after 10-15 min?

Something changes to slow them down ? Sorry for the obvious retort :-)
See what iostat has to say. If the %b column is climbing, then you are
slowly saturating the drives themselves, for example.

> 2. Why does SG SAS drive show only 70+ MB/s where is the published figures
> are > 100MB/s refer here?

"published" where ? What does a "dd" to the device itself (no ZFS, no
FS at all) show ? For example, `dd if=/dev/zero
of=/dev/dsk/c7t5000C50035062EC1d0s0 bs=1024k count=100` (after you
destroy the zpool and use format to create an s0 of the entire disk).
This will test the device driver / HBA / drive with no FS or volume
manager involved. Use iostat to watch the OS <-> drive interface.

> 3. All 4 drives are connected to a single HBA, so I assume the mpt_sas
> driver is used. Are SAS and SATA drives handled differently ?

I assume there are (at least) four ports on the HBA ? I assume this
from the c7, c8, c1, c2 device names. That means that the drives
should _not_ be affecting each other. As another poster mentioned, the
behavior of the interface chip may change based on which drives are
seeing I/O, but I doubt that would be this big of a factor.

> This is a test server, so any ideas to try and help me understand greatly
> appreciated.

What do real benchmarks (iozone, filebench, orion) show ?

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy snapshot runs out of memory bug

2011-10-31 Thread Paul Kraus
On Mon, Oct 31, 2011 at 9:07 AM, Jim Klimov  wrote:
> 2011-10-31 16:28, Paul Kraus wrote:

>> Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of
>> RAM
>> to destroy the problem snapshot). I had not yet posted a summary as we
>> are still working through the overall problem (we tripped over this on
>> the replica, now we are working on it on the production copy).
>
> Good for you ;)
> Does Oracle loan such systems free to support their own foul-ups?
> Or do you have to pay a lease anyway? ;)

If you are paying for a support contract, _demand_ what is needed
to fix the problem. If you are not paying for support, well, then you
are on your own (as I believe the license says).

Maybe I've been in this business longer than many of the folks
here, but I both expect software to have bugs and I do NOT expect
commercial software vendors to provide fixes for free.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy snapshot runs out of memory bug

2011-10-31 Thread Jim Klimov

2011-10-31 16:28, Paul Kraus wrote:

 How big is / was the snapshot and dataset ? I am dealing with a 7
TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM.


I had a smaller-scale problem, with datasets and snapshots sized
several hundred GB, but on an 8Gb RAM system. So proportionally
it seems similar ;)

I have deduped data on the system, which adds to the strain of
dataset removal. The plan was to save some archive data there,
with few to no removals planned. But during testing of different
dataset layout hierarchies, things got out of hand ;)

I've also had an approx. 4Tb dataset to destroy (a volume where
I kept another pool), but armed with the knowledge of how things
are expected to fail, I did its cleanup in small steps and very
few (perhaps no?) hangs while evacuating the data to the toplevel
pool (which contained this volume).


Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of RAM
to destroy the problem snapshot). I had not yet posted a summary as we
are still working through the overall problem (we tripped over this on
the replica, now we are working on it on the production copy).



Good for you ;)
Does Oracle loan such systems free to support their own foul-ups?
Or do you have to pay a lease anyway? ;)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy snapshot runs out of memory bug

2011-10-31 Thread Paul Kraus
On Sun, Oct 30, 2011 at 5:13 PM, Jim Klimov  wrote:
>>     I know there was (is ?) a bug where a zfs destroy of a large
>> snapshot would run a system out of kernel memory, but searching the

> Symptoms are like what you've described, including the huge scanrate
> just before the system dies (becomes unresponsive). Also if you try running
> with "vmstat 1" you can see that in the last few seconds of
> uptime the system would go from several hundred free MBs (or even
> over a GB free RAM) down to under 32Mb very quickly - consuming
> hundreds of MBs per second.

That is the traditional symptoms of a Solaris kernel memory bug :-)

> Unlike your system, my pool started with ZFSv28 (oi_148a), so any
> bugfixes and on-disk layout fixes relevant for ZFSv26 patches are
> in place already.

Ahhh, but jumping to the end...

> In my case I saw that between reboots and import attempts this
> counter went down by some 3 million blocks every uptime, and
> after a couple of stressful weeks the destroyed dataset was gone
> and the pool just worked on and on.

So your pool does have the fix. With zpool 22 NO PROGRESS is made
at all with each boot-import-habg cycle. I have an mdb command that I
got from Oracle support to determine the size of the snapshot that is
being destroyed. The bug in 22 is that a snapshot destroy is committed
as a single TXG. In 26 this is fixed (I assume there are on disk
checkpoints to permit a snapshot to be destroyed in multiple TXG).

How big is / was the snapshot and dataset ? I am dealing with a 7
TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM. Oracle
has provided a loaner system with 128 GB RAM and it took 75 GB of RAM
to destroy the problem snapshot). I had not yet posted a summary as we
are still working through the overall problem (we tripped over this on
the replica, now we are working on it on the production copy).

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss