Dear All, Again I face some problems with COMSTAR iSCSI, this time with fresh install of snv_134. I do not believe it is a hardware problem or performance problem of the OpenSolaris box, because I have fought with this before. Since 2009.06 had problems with iSCSI and VMWare compatibility, I did image update to latest (forgot which dev build, possibly few months ago) and the iSCSI write speed was blazing fast.
Target (whitebox) has 6 x 1.5 TB SATA disks in RAIDZ, 2.4 GB RAM allocated to it, OpenSolaris snv_134 as ESXi 4.0.1 guest. Initiator (proper HP server) is ESXi 4.0.0. CIFS performace is as expected but iSCSI writes are very slow. Or to be more precise, they are not slow, they are "bursty" and a lot of times the initiator loses connection due to timeout. But sometimes there are no problems: I run a script to backup all VM to the iSCSI LUN and on initial test run I had very bad results. No VM was transferred successfully due to timeout at each VM cloning (sizes ranging from 8 GB to 50 GB), initiator lost timed out connection to target. Eventually, on one run out of 14 VM only 4 failed due to timeout, total size being around 400 GB. I have not ruled out problem at initiator side at this stage, maybe ESXi has a problem.... At the moment the same 3 VM being copied (cloned) timeout after multiple attempts (~10!). At first I tried with dedup turned on, but the initiator constantly losed connection to the target due to timeouts and was completely unusable - maybe the CPU at target (Core 2 Duo E8400) could not keep up with it. After Dedup turned off for this volume, at least I can have writes somewhat working. On a related note, I tried in ESXi to increase the CPU amount for OpenSolaris from 1 CPU to 2 but for some reason that totally bogged down the system when testing this, the whole system became hung and extremely lagged at times due to both cores at 100%. Tried with cross-over cable to eliminate problem at switch. MTU 1500, same network settings as in the test few months ago when I got the speed I was expecting. I am very new to OpenSolaris but have plenty of experience with other Unix systems so any other debugging pointers are appreciated. I have done some testing in hopes that someone could make sense out of this, below is some output of zpool iostat and iostat..: @fs1:~$ zpool iostat pool 2 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- --- this is slightly after the writes start, numbers were similar to this before --- pool 389G 6.43T 4 17.8K 9.61K 41.6M pool 390G 6.43T 22 14.8K 51.0K 84.2M pool 390G 6.43T 76 1.30K 187K 3.05M --- at least at this point the write is still working but very "bursty" as you can see --- pool 389G 6.43T 0 8.82K 0 69.1M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.77K 0 68.1M pool 389G 6.43T 0 41 0 76.5K [0,336/1203] pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 6.48K 0 51.7M pool 389G 6.43T 0 1.98K 0 15.0M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.46K 0 66.5M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.38K 0 66.3M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 7.90K 0 62.8M pool 389G 6.43T 0 114 0 310K pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.37K 0 66.2M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.37K 0 66.3M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 6.74K 0 53.6M pool 389G 6.43T 0 43 0 83.2K pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.42K 0 66.9M pool 389G 6.43T 0 35 0 71.9K pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 7.34K 0 58.4M pool 389G 6.43T 0 44 0 84.6K pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.15K 0 64.8M pool 389G 6.43T 0 43 0 81.7K pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 537 0 4.15M pool 389G 6.43T 0 7.81K 0 61.9M [0,294/1203] pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.36K 0 66.2M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.39K 0 66.5M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 4.00K 0 pool 389G 6.43T 0 7.10K 0 56.5M pool 389G 6.43T 0 820 0 6.09M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 9.06K 0 71.8M pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 7.35K 0 58.5M pool 389G 6.43T 0 43 0 84.7K pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 0 0 0 pool 389G 6.43T 0 8.46K 0 67.1M --- at this point "something" happens and only reads happen until the initiator eventually times out. The ESXi initiator does not /always/ lose connection to the target, just the cloning of VM says "timed out" --- pool 389G 6.43T 2 0 15.0K 0 pool 389G 6.43T 13 0 81.0K 0 pool 389G 6.43T 15 0 96.0K 0 pool 389G 6.43T 24 8.41K 98.5K 66.9M pool 389G 6.43T 103 0 296K 0 pool 389G 6.43T 98 0 311K 0 pool 389G 6.43T 109 0 327K 0 pool 389G 6.43T 106 0 280K 0 pool 389G 6.43T 94 0 220K 0 pool 389G 6.43T 85 0 203K 0 pool 389G 6.43T 91 0 219K 0 pool 389G 6.43T 88 0 214K 0 pool 389G 6.43T 85 0 201K 0 pool 389G 6.43T 84 0 201K 0 pool 389G 6.43T 91 0 214K 0 pool 389G 6.43T 82 0 193K 0 pool 389G 6.43T 82 0 196K 0 pool 389G 6.43T 86 0 209K 0 pool 389G 6.43T 93 0 215K 0 pool 389G 6.43T 111 0 252K 0 pool 389G 6.43T 137 0 304K 0 pool 389G 6.43T 98 0 225K 0 pool 389G 6.43T 102 0 238K 0 pool 389G 6.43T 85 0 204K 0 pool 389G 6.43T 85 0 203K 0 pool 389G 6.43T 77 0 182K 0 pool 389G 6.43T 100 0 229K 0 pool 389G 6.43T 92 0 215K 0 pool 389G 6.43T 89 0 209K 0 pool 389G 6.43T 91 0 219K 0 pool 389G 6.43T 86 0 204K 0 pool 389G 6.43T 87 0 209K 0 pool 389G 6.43T 91 0 209K 0 pool 389G 6.43T 112 0 256K 0 pool 389G 6.43T 89 0 212K 0 pool 389G 6.43T 98 0 226K 0 pool 389G 6.43T 108 0 246K 0 pool 389G 6.43T 90 0 213K 0 pool 389G 6.43T 150 0 341K 0 pool 389G 6.43T 98 0 227K 0 pool 389G 6.43T 92 0 215K 0 pool 389G 6.43T 89 0 211K 0 pool 389G 6.43T 88 0 211K 0 pool 389G 6.43T 105 0 240K 0 pool 389G 6.43T 128 0 292K 0 pool 389G 6.43T 82 0 198K 0 pool 389G 6.43T 86 0 204K 0 pool 389G 6.43T 96 0 224K 0 pool 389G 6.43T 91 0 214K 0 pool 389G 6.43T 102 0 235K 0 >From a separate attempt the iostat output: @fs1:~$ iostat 2 1000 tty sd0 sd1 sd2 sd3 cpu tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id --- everything goes fine for a while --- 0 2 34 1 11 0 0 0 1699 36 6 1698 36 6 7 12 0 81 0 118 0 0 0 0 0 0 18589 172 5 18589 172 5 8 86 0 6 0 41 0 0 0 0 0 0 10409 224 10 9897 202 11 10 86 0 4 0 42 1 1 8 0 0 0 12153 214 11 12619 236 13 11 83 0 6 0 42 0 0 0 0 0 0 18153 231 5 18090 231 5 9 85 0 6 0 42 0 0 0 0 0 0 17643 213 5 17963 215 5 7 88 0 5 0 42 0 0 0 0 0 0 3374 56 4 1799 42 7 13 77 0 10 0 40 0 0 0 0 0 0 15884 193 3 17128 203 2 10 85 0 5 0 44 0 0 0 0 0 0 18951 181 3 18952 184 3 9 86 0 5 ^C @fs1:~$ iostat 5 1000 tty sd0 sd1 sd2 sd3 cpu tin tout kps tps serv kps tps serv kps tps serv kps tps serv us sy wt id 0 2 34 1 11 0 0 0 1702 36 6 1702 36 6 7 12 0 81 0 41 39 10 2 0 0 0 16486 160 6 16155 157 5 10 84 0 6 0 20 0 0 1 0 0 0 9062 121 5 9017 120 4 12 53 0 35 --- at this point it seems the write has stalled --- 0 16 0 0 0 0 0 0 87 89 13 78 84 11 12 3 0 85 0 16 0 0 0 0 0 0 65 81 9 55 81 9 12 4 0 84 0 16 0 0 0 0 0 0 1598 136 15 1576 138 12 12 26 0 62 0 16 0 0 0 0 0 0 7549 138 7 7577 133 6 11 32 0 57 0 16 0 0 0 0 0 0 46 78 7 47 82 7 12 4 0 84 0 16 10 0 14 0 0 0 44 78 7 44 78 7 11 4 0 84 0 16 0 0 0 0 0 0 45 80 7 45 80 6 11 3 0 86 0 16 0 0 0 0 0 0 43 75 7 41 74 7 10 4 0 86 0 16 0 0 0 0 0 0 47 80 7 45 80 7 10 4 0 86 0 16 0 0 0 0 0 0 44 80 7 45 82 7 9 4 0 87 I have also tried to see if it comes from the vmkfstools that does the VM cloning to the iSCSI LUN: # time dd if=/dev/zero of=/vmfs/volumes/fs1-vmbackup/test bs=65536 count=10000 Same thing, in the beginning the speed is good and eventually few bursts between 15-12 seconds, I believe the time between bursts is due to the initiator timing out on the connection and then re-establishing the connection. For those who understand ESXi..: May 7 06:14:04 vmkernel: 1:00:17:17.510 cpu7:4179)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410005074f40) to NMP device "naa.600144f065f30c0000004be283b a0003" failed on physical path "vmhba34:C0:T3:L1" H:0x5 D:0x40 P:0x0 Possible sense data: May 7 06:14:05 0x2 0x3a 0x1. May 7 06:14:04 vmkernel: 1:00:17:17.510 cpu7:4179)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.600144f065f30c0000004be283ba0003" state in do ubt; requested fast path state update... May 7 06:14:04 vmkernel: 1:00:17:17.510 cpu7:4179)ScsiDeviceIO: 747: Command 0x2a to device "naa.600144f065f30c0000004be283ba0003" failed H:0x5 D:0x40 P:0x0 Po ssible sense data: 0x2 0x3a 0x1. -- This message posted from opensolaris.org _______________________________________________ storage-discuss mailing list storage-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/storage-discuss