Could you share your patches to vzmigrate and vzctl? On Thu, Jul 10, 2014 at 2:25 PM, Pavel Odintsov <pavel.odint...@gmail.com> wrote: > Thank you for your answers! It's really useful information. > > On Thu, Jul 10, 2014 at 2:08 PM, Pavel Snajdr <li...@snajpa.net> wrote: >> On 07/10/2014 11:35 AM, Pavel Odintsov wrote: >>>> Not true, IO limits are working as they should (if we're talking vzctl >>>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO >>>> accounting support, so it is there. >>> >>> You can share tests with us? For standard folders like simfs this >>> limits works bad in big number of cases >> >> If you can give me concrete tests to run, sure, I'm curious to see if >> you're right - then we'd have something concrete to fix :) >> >>> >>>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit >>>> really) >>> >>> It's ok when your customer create 1 billion of small files on 10GB VPS >>> and you will try to archive it for backup? On slow disk system it's >>> really nightmare because a lot of disk operations which kills your >>> I/O. >> >> zfs snapshot <dataset>@<snapname> >> zfs send <dataset>@<snapname> > your-file or | ssh backuper zfs recv >> <backupdataset> >> >> That's done on block level. No need to run rsync anymore, it's a lot >> faster this way. >> >>> >>>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS, >>>> I thought the point of migration is to don't have the CT notice any >>>> change, I don't see why the inode numbers should change. >>> >>> Do you have really working zero downtime vzmigrate on ZFS? >> >> Nope, vzmigrate isn't zero downtime. Due to vzctl/vzmigrate not >> supporting ZFS, we're implementing this our own way in vpsAdmin, which >> in it's 2.0 re-implementation will go opensource under GPL. >> >>> >>>> How exactly? I haven't seen a problem with any userspace software, other >>>> than MySQL default setting to AIO (it fallbacks to older method), which >>>> ZFS doesn't support (*yet*, they have it in their plans). >>> >>> I speaks about MySQL primarily. I have thousands of containers and I >>> can tune MySQL for another mode for all customers, it's impossible. >> >> As I said, this is under development and will improve. >> >>> >>>> L2ARC cache really smart >>> >>> Yep, fine, I knew. But can you account L2ARC cache usage per customer? >>> OpenVZ can it via flag: >>> sysctl -a|grep pagecache_isola >>> ubc.pagecache_isolation = 0 >> >> I can't account for caches per CT, but I didn't have any need to do so. >> >> L2ARC != ARC, ARC is in system RAM, L2ARC is intended to be on SSD for >> the content of ARC that is the least significant in case of low memory - >> it gets pushed from ARC to L2ARC. >> >> ARC has two primary lists of cached data - most frequently used and most >> recently used and these two lists are divided by a boundary marking >> which data can be pushed away in low mem situation. >> >> It doesn't happen like with Linux VFS cache that you're copying one big >> file and it pushes out all of the other useful data there. >> >> Thanks to this distinction of MRU and MFU ARC achieves far better hitrates. >> >>> >>> But one customer can eat almost all L2ARC cache and displace another >>> customers data. >> >> Yes, but ZFS keeps track on what's being used, so useful data can't be >> pushed away that easily, things naturally balance themselves due to the >> way how ARC mechanism works. >> >>> >>> I'm not agains ZFS but I'm against of usage ZFS as underlying system >>> for containers. We caught ~100 kernel bugs with simfs on EXT4 when >>> customers do some strange thinks. >> >> I haven't encountered any problems especially with vzquota disabled (no >> need for it, ZFS has its own quotas, which never need to be recalculated >> as with vzquota). >> >>> >>> But ext4 has about few thouasands developers and the fix this issues >>> asap but ZFS on Linux has only 3-5 developers which VERY slow. >>> Because of this I recommends using ext4 with ploop because this >>> solution is rock stable or ZFS with ZVOL's with ext4 because this >>> solution if more reliable and more predictable then placing ZFS >>> containers on ZFS volumes. >> >> ZFS itself is a stable and mature filesystem, it first shipped as >> production with Solaris in 2006. >> And it's still being developed upstream as OpenZFS, that code is shared >> between the primary version - Illumos and the ports - FreeBSD, OS X, Linux. >> >> So what really needs and still is being developed is the way how ZFS is >> run under Linux kernel, but with recent release of 0.6.3, things have >> gotten mature enough to be used in production without any fears. Of >> course, no software is without bugs, but I can say with absolute >> certainty that ZFS will never eat your data, the only problem you can >> encounter is with the memory management, which is done really >> differently in Linux than in ZFS's original habitat - Solaris. >> >> /snajpa >> >>> >>> >>> On Thu, Jul 10, 2014 at 1:08 PM, Pavel Snajdr <li...@snajpa.net> wrote: >>>> On 07/10/2014 10:34 AM, Pavel Odintsov wrote: >>>>> Hello! >>>>> >>>>> You scheme is fine but you can't divide I/O load with cgroup blkio >>>>> (ioprio/iolimit/iopslimit) between different folders but between >>>>> different ZVOL you do. >>>> >>>> Not true, IO limits are working as they should (if we're talking vzctl >>>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO >>>> accounting support, so it is there. >>>> >>>>> >>>>> I could imagine following problems for per folder scheme: >>>>> 1) Can't limit number of inodes in different folders (but there are >>>>> not an inode limit for ZFS like ext4 but bug amount of files in >>>>> container could broke node; >>>> >>>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit >>>> really) >>>> >>>>> http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs) >>>>> 2) Problems with system cache which used by all containers in HWN together >>>> >>>> This exactly isn't a problem, but a *HUGE* benefit, you'd need to see it >>>> in practice :) Linux VFS cache is really dumb in comparison to ARC. >>>> ARC's hitrates just can't be done with what linux currently offers. >>>> >>>>> 3) Problems with live migration because you _should_ change inode >>>>> numbers on diffferent nodes >>>> >>>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS, >>>> I thought the point of migration is to don't have the CT notice any >>>> change, I don't see why the inode numbers should change. >>>> >>>>> 4) ZFS behaviour with linux software in some cases is very STRANGE >>>>> (DIRECT_IO) >>>> >>>> How exactly? I haven't seen a problem with any userspace software, other >>>> than MySQL default setting to AIO (it fallbacks to older method), which >>>> ZFS doesn't support (*yet*, they have it in their plans). >>>> >>>>> 5) ext4 has good support from vzctl (fsck, resize2fs) >>>> >>>> Yeah, but ext4 sucks big time. At least in my use-case. >>>> >>>> We've implemented most of vzctl create/destroy/etc. functionality in our >>>> vpsAdmin software instead. >>>> >>>> Guys, can I ask you to keep your mind open instead of fighting with >>>> pointless arguments? :) Give ZFS a try and then decide for yourselves. >>>> >>>> I think the community would benefit greatly if ZFS woudn't be fought as >>>> something alien in the Linux world, which I in my experience is what >>>> every Linux zealot I talk to about ZFS is doing. >>>> This is just not fair. It's primarily about technology, primarily about >>>> the best tool for the job. If we can implement something like this in >>>> Linux but without having ties to CDDL and possibly Oracle patents, that >>>> would be awesome, yet nobody has done such a thing yet. BTRFS is nowhere >>>> near ZFS when it comes to running larger scale deployments and in some >>>> regards I don't think it will ever match ZFS, just looking at the way >>>> it's been designed. >>>> >>>> I'm not trying to flame here, I'm trying to open you guys to the fact, >>>> that there really is a better alternative than you're currently seeing. >>>> And if it has some technological drawbacks like these that you're trying >>>> to point out, instead of pointing at them as something, which can't be >>>> changed and thus everyone should use "your best solution(tm)", try to >>>> think of ways how to change it for the better. >>>> >>>>> >>>>> My ideas like simfs vs ploop comparison: >>>>> http://openvz.org/images/f/f3/Ct_in_a_file.pdf >>>> >>>> Again, you have to see ZFS doing its magic in production under a really >>>> heavy load, otherwise you won't understand. Any arbitrary benchmarks >>>> I've seen show ZFS is slower than ext4, but these are not tuned for such >>>> use cases as I'm talking about. >>>> >>>> /snajpa >>>> >>>>> >>>>> On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <li...@snajpa.net> wrote: >>>>>> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote: >>>>>>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote: >>>>>>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote: >>>>>>>>> Greetings, >>>>>>>>> >>>>>>>>> ----- Original Message ----- >>>>>>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something like >>>>>>>>>> Nexenta is to expensive for us. >>>>>>>>> From what I've gathered from a few presentations, ZFS on Linux >>>>>>>>> (http://zfsonlinux.org/) is as stable but more performant than it is >>>>>>>>> on the OpenSolaris forks... so you can build your own if you can >>>>>>>>> spare the people to learn the best practices. >>>>>>>>> >>>>>>>>> I don't have a use for ZFS myself so I'm not really advocating it. >>>>>>>>> >>>>>>>>> TYL, >>>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 CTs >>>>>>>> at >>>>>>>> least). We've used to run ext4+flashcache, but ext4 has proven to be a >>>>>>>> bottleneck. That was the primary motivation behind ploop as far as I >>>>>>>> know. >>>>>>>> >>>>>>>> We've switched to ZFS on Linux around the time Ploop was announced and >>>>>>>> I >>>>>>>> didn't have second thoughts since. ZFS really *is* in my experience the >>>>>>>> best filesystem there is at the moment for this kind of deployment - >>>>>>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the >>>>>>>> latter is less important. You will know what I'm talking about when you >>>>>>>> try this on boxes with lots of CTs doing LAMP load - databases and >>>>>>>> their >>>>>>>> synchronous writes are the real problem, which ZFS with dedicated ZIL >>>>>>>> device solves. >>>>>>>> >>>>>>>> Also there is the ARC caching, which is smarter then linux VFS cache - >>>>>>>> we're able to achieve about 99% of hitrate at about 99% of the time, >>>>>>>> even under high loads. >>>>>>>> >>>>>>>> Having said all that, I recommend everyone to give ZFS a chance, but >>>>>>>> I'm >>>>>>>> aware this is yet another out-of-mainline code and that doesn't suit >>>>>>>> everyone that well. >>>>>>>> >>>>>>> >>>>>>> Are you using per-container ZVOL or something else? >>>>>> >>>>>> That would mean I'd need to do another filesystem on top of ZFS, which >>>>>> would in turn mean I'd add another unnecessary layer of indirection. ZFS >>>>>> is a pooled storage like BTRFS is, we're giving one dataset to each >>>>>> container. >>>>>> >>>>>> vzctl tries to move the VE_PRIVATE folder around, so we had to add one >>>>>> more directory to put the VE_PRIVATE data into (see the first ls). >>>>>> >>>>>> Example from production: >>>>>> >>>>>> [r...@node2.prg.vpsfree.cz] >>>>>> ~ # zpool status vz >>>>>> pool: vz >>>>>> state: ONLINE >>>>>> scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul 8 16:22:17 >>>>>> 2014 >>>>>> config: >>>>>> >>>>>> NAME STATE READ WRITE CKSUM >>>>>> vz ONLINE 0 0 0 >>>>>> mirror-0 ONLINE 0 0 0 >>>>>> sda ONLINE 0 0 0 >>>>>> sdb ONLINE 0 0 0 >>>>>> mirror-1 ONLINE 0 0 0 >>>>>> sde ONLINE 0 0 0 >>>>>> sdf ONLINE 0 0 0 >>>>>> mirror-2 ONLINE 0 0 0 >>>>>> sdg ONLINE 0 0 0 >>>>>> sdh ONLINE 0 0 0 >>>>>> logs >>>>>> mirror-3 ONLINE 0 0 0 >>>>>> sdc3 ONLINE 0 0 0 >>>>>> sdd3 ONLINE 0 0 0 >>>>>> cache >>>>>> sdc5 ONLINE 0 0 0 >>>>>> sdd5 ONLINE 0 0 0 >>>>>> >>>>>> errors: No known data errors >>>>>> >>>>>> [r...@node2.prg.vpsfree.cz] >>>>>> ~ # zfs list >>>>>> NAME USED AVAIL REFER MOUNTPOINT >>>>>> vz 432G 2.25T 36K /vz >>>>>> vz/private 427G 2.25T 111K /vz/private >>>>>> vz/private/101 17.7G 42.3G 17.7G /vz/private/101 >>>>>> <snip> >>>>>> vz/root 104K 2.25T 104K /vz/root >>>>>> vz/template 5.38G 2.25T 5.38G /vz/template >>>>>> >>>>>> [r...@node2.prg.vpsfree.cz] >>>>>> ~ # zfs get compressratio vz/private/101 >>>>>> NAME PROPERTY VALUE SOURCE >>>>>> vz/private/101 compressratio 1.38x - >>>>>> >>>>>> [r...@node2.prg.vpsfree.cz] >>>>>> ~ # ls /vz/private/101 >>>>>> private >>>>>> >>>>>> [r...@node2.prg.vpsfree.cz] >>>>>> ~ # ls /vz/private/101/private/ >>>>>> aquota.group aquota.user b bin boot dev etc git home lib >>>>>> <snip> >>>>>> >>>>>> [r...@node2.prg.vpsfree.cz] >>>>>> ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT" >>>>>> VE_ROOT="/vz/root/101" >>>>>> VE_PRIVATE="/vz/private/101/private" >>>>>> >>>>>> >>>>>>> _______________________________________________ >>>>>>> Users mailing list >>>>>>> Users@openvz.org >>>>>>> https://lists.openvz.org/mailman/listinfo/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@openvz.org >>>>>> https://lists.openvz.org/mailman/listinfo/users >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@openvz.org >>>> https://lists.openvz.org/mailman/listinfo/users >>> >>> >>> >> >> _______________________________________________ >> Users mailing list >> Users@openvz.org >> https://lists.openvz.org/mailman/listinfo/users > > > > -- > Sincerely yours, Pavel Odintsov
-- Sincerely yours, Pavel Odintsov _______________________________________________ Users mailing list Users@openvz.org https://lists.openvz.org/mailman/listinfo/users