On Wed, 2015-04-22 at 11:54 +0200, Maikel vd Mosselaar wrote: > Yes we are aware of that, problem is it's running production so not very > easy to change the pool. > > On 04/22/2015 11:48 AM, InterNetX - Juergen Gotteswinter wrote: > > i expect that you are aware of the fact that you only get the write > > performance of a single disk in that configuration? i whould drop that > > pool configuration, drop the spare drives and go for a mirror pool.
^ What he said:) That, or if you have more space to add another 2 disks and use them plus the spare drives to add a second raidz(1|2|3) vdev. What drives do you use for data, log and cache? /K > > > > Am 22.04.2015 um 11:39 schrieb Maikel vd Mosselaar: > >> pool: z2pool > >> state: ONLINE > >> scan: scrub canceled on Sun Apr 12 16:33:38 2015 > >> config: > >> > >> NAME STATE READ WRITE CKSUM > >> z2pool ONLINE 0 0 0 > >> raidz1-0 ONLINE 0 0 0 > >> c0t5000C5004172A87Bd0 ONLINE 0 0 0 > >> c0t5000C50041A59027d0 ONLINE 0 0 0 > >> c0t5000C50041A592AFd0 ONLINE 0 0 0 > >> c0t5000C50041A660D7d0 ONLINE 0 0 0 > >> c0t5000C50041A69223d0 ONLINE 0 0 0 > >> c0t5000C50041A6ADF3d0 ONLINE 0 0 0 > >> logs > >> c0t5001517BB2845595d0 ONLINE 0 0 0 > >> cache > >> c0t5001517BB2847892d0 ONLINE 0 0 0 > >> spares > >> c0t5000C50041A6B737d0 AVAIL > >> c0t5000C50041AC3F07d0 AVAIL > >> c0t5000C50041AD48DBd0 AVAIL > >> c0t5000C50041ADD727d0 AVAIL > >> > >> errors: No known data errors > >> > >> > >> On 04/22/2015 11:17 AM, Karli Sjöberg wrote: > >>> On Wed, 2015-04-22 at 11:12 +0200, Maikel vd Mosselaar wrote: > >>>> Our pool is configured as Z1 with ZIL (normal SSD), the sync parameter > >>>> is on the default setting (standard) so "sync" is on. > >>> # zpool status ? > >>> > >>> /K > >>> > >>>> When the issue happens oVirt event viewer shows indeed latency warnings. > >>>> Not always but most of the time this will be followed by an i/o storage > >>>> error linked to random VMs and they will be paused when that happens. > >>>> > >>>> All the nodes use mode 4 bonding. The interfaces on the nodes don't show > >>>> any drops or errors, i checked 2 of the VMs that got paused the last > >>>> time it happened they have dropped packets on their interfaces. > >>>> > >>>> We don't have a subscription with nexenta (anymore). > >>>> > >>>> On 04/21/2015 04:41 PM, InterNetX - Juergen Gotteswinter wrote: > >>>>> Am 21.04.2015 um 16:19 schrieb Maikel vd Mosselaar: > >>>>>> Hi Juergen, > >>>>>> > >>>>>> The load on the nodes rises far over >200 during the event. Load on > >>>>>> the > >>>>>> nexenta stays normal and nothing strange in the logging. > >>>>> ZFS + NFS could be still the root of this. Your Pool Configuration is > >>>>> RaidzX or Mirror, with or without ZIL? The sync Parameter of your ZFS > >>>>> Subvolume which gets exported is kept default on "standard" ? > >>>>> > >>>>> http://christopher-technicalmusings.blogspot.de/2010/09/zfs-and-nfs-performance-with-zil.html > >>>>> > >>>>> > >>>>> Since Ovirt acts very sensible about Storage Latency (throws VM into > >>>>> unresponsive or unknown state) it might be worth a try to do "zfs set > >>>>> sync=disabled pool/volume" to see if this changes things. But be aware > >>>>> that this makes the NFS Export vuln. against dataloss in case of > >>>>> powerloss etc, comparable to async NFS in Linux. > >>>>> > >>>>> If disabling the sync setting helps, and you dont use a seperate ZIL > >>>>> Flash Drive yet -> this whould be very likely help to get rid of this. > >>>>> > >>>>> Also, if you run a subscribed Version of Nexenta it might be helpful to > >>>>> involve them. > >>>>> > >>>>> Do you see any messages about high latency in the Ovirt Events Panel? > >>>>> > >>>>>> For our storage interfaces on our nodes we use bonding in mode 4 > >>>>>> (802.3ad) 2x 1Gb. The nexenta has 4x 1Gb bond in mode 4 also. > >>>>> This should be fine, as long as no Node uses Mode0 / Round Robin which > >>>>> whould lead to out of order TCP Packets. The Interfaces themself dont > >>>>> show any Drops or Errors - on the VM Hosts as well as on the Switch > >>>>> itself? > >>>>> > >>>>> Jumbo Frames? > >>>>> > >>>>>> Kind regards, > >>>>>> > >>>>>> Maikel > >>>>>> > >>>>>> > >>>>>> On 04/21/2015 02:51 PM, InterNetX - Juergen Gotteswinter wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> how about Load, Latency, strange dmesg messages on the Nexenta ? > >>>>>>> You are > >>>>>>> using bonded Gbit Networking? If yes, which mode? > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Juergen > >>>>>>> > >>>>>>> Am 20.04.2015 um 14:25 schrieb Maikel vd Mosselaar: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> We are running ovirt 3.5.1 with 3 nodes and seperate engine. > >>>>>>>> > >>>>>>>> All on CentOS 6.6: > >>>>>>>> 3 x nodes > >>>>>>>> 1 x engine > >>>>>>>> > >>>>>>>> 1 x storage nexenta with NFS > >>>>>>>> > >>>>>>>> For multiple weeks we are experiencing issues of our nodes that > >>>>>>>> cannot > >>>>>>>> access the storage at random moments (atleast thats what the nodes > >>>>>>>> think). > >>>>>>>> > >>>>>>>> When the nodes are complaining about a unavailable storage then > >>>>>>>> the load > >>>>>>>> rises up to +200 on all three nodes, this causes that all running > >>>>>>>> VMs > >>>>>>>> are unaccessible. During this process oVirt event viewer shows > >>>>>>>> some i/o > >>>>>>>> storage error messages, when this happens random VMs get paused > >>>>>>>> and will > >>>>>>>> not be resumed anymore (this almost happens every time but not > >>>>>>>> all the > >>>>>>>> VMs get paused). > >>>>>>>> > >>>>>>>> During the event we tested the accessibility from the nodes to the > >>>>>>>> storage and it looks like it is working normal, at least we can do a > >>>>>>>> normal > >>>>>>>> "ls" on the storage without any delay of showing the contents. > >>>>>>>> > >>>>>>>> We tried multiple things that we thought it causes this issue but > >>>>>>>> nothing worked so far. > >>>>>>>> * rebooting storage / nodes / engine. > >>>>>>>> * disabling offsite rsync backups. > >>>>>>>> * moved the biggest VMs with highest load to different platform > >>>>>>>> outside > >>>>>>>> of oVirt. > >>>>>>>> * checked the wsize and rsize on the nfs mounts, storage and > >>>>>>>> nodes are > >>>>>>>> correct according to the "NFS troubleshooting page" on ovirt.org. > >>>>>>>> > >>>>>>>> The environment is running in production so we are not free to test > >>>>>>>> everything. > >>>>>>>> > >>>>>>>> I can provide log files if needed. > >>>>>>>> > >>>>>>>> Kind Regards, > >>>>>>>> > >>>>>>>> Maikel > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Users mailing list > >>>>>>>> Users@ovirt.org > >>>>>>>> http://lists.ovirt.org/mailman/listinfo/users > >>>>>>> _______________________________________________ > >>>>>>> Users mailing list > >>>>>>> Users@ovirt.org > >>>>>>> http://lists.ovirt.org/mailman/listinfo/users > >>>> _______________________________________________ > >>>> Users mailing list > >>>> Users@ovirt.org > >>>> http://lists.ovirt.org/mailman/listinfo/users > >>> _______________________________________________ > >>> Users mailing list > >>> Users@ovirt.org > >>> http://lists.ovirt.org/mailman/listinfo/users > >> _______________________________________________ > >> Users mailing list > >> Users@ovirt.org > >> http://lists.ovirt.org/mailman/listinfo/users > > _______________________________________________ > > Users mailing list > > Users@ovirt.org > > http://lists.ovirt.org/mailman/listinfo/users > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users