Re: [pve-devel] Two-Node HA
FYI, there was support to use a quorum Disk in 3.X: https://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster But corosync developers decided to drop support for that. AFAIK this technology is a leftover from old days, and nobody was happy with that complex and error prone software. That for, they decided to implement a better way to provide quorum (qdevice, qnetd) ... ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pve-manager and disk IO monitoring
> So my question is: Why use scsi-generic instead of scsi-block when > scsi-generic prevents blockstats? commit d454d040338a6216c8d3e5cc9623d6223476cb5a Author: Alexandre DerumierDate: Tue Aug 28 12:46:07 2012 +0200 use scsi-generic by default with libiscsi This add scsi passthrough with libiscsi Signed-off-by: Alexandre Derumier @Alexandre: This was for performance reasons? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pve-manager and disk IO monitoring
"Running a fio test also only shows marginal performance difference between scsi-block and scsi-generic" I think that 11% difference is not so marginal. I'm curious to see difference with full flash array, if we have the same cpu iothread bottleneck like ceph, with scsi-block vs scsi-generic. Maybe can we add an option to choose between scsi-block && scsi-generic - Mail original - De: "datanom.net"À: "pve-devel" Envoyé: Vendredi 30 Septembre 2016 01:23:20 Objet: Re: [pve-devel] pve-manager and disk IO monitoring On Fri, 30 Sep 2016 00:51:06 +0200 Michael Rasmussen wrote: > > So my question is: Why use scsi-generic instead of scsi-block when > scsi-generic prevents blockstats? > Running a fio test also only shows marginal performance difference between scsi-block and scsi-generic -device scsi-block iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, iodepth=64 fio-2.1.11 Starting 1 process iometer: Laying out IO file(s) (1 file(s) / 3072MB) Jobs: 1 (f=1): [m(1)] [100.0% done] [73928KB/19507KB/0KB /s] [17.3K/4381/0 iops] [eta 00m:00s] iometer: (groupid=0, jobs=1): err= 0: pid=1568: Fri Sep 30 01:17:05 2016 Description : [Emulation of Intel IOmeter File Server Access Pattern] read : io=2454.9MB, bw=87501KB/s, iops=14328, runt= 28728msec slat (usec): min=2, max=4703, avg=10.47, stdev=16.99 clat (usec): min=315, max=1505.6K, avg=3479.55, stdev=8270.22 lat (usec): min=321, max=1505.6K, avg=3490.40, stdev=8270.14 clat percentiles (usec): | 1.00th=[ 1768], 5.00th=[ 2480], 10.00th=[ 2640], 20.00th=[ 2864], | 30.00th=[ 2960], 40.00th=[ 3056], 50.00th=[ 3088], 60.00th=[ 3152], | 70.00th=[ 3248], 80.00th=[ 3376], 90.00th=[ 3824], 95.00th=[ 4448], | 99.00th=[ 8768], 99.50th=[13120], 99.90th=[52992], 99.95th=[103936], | 99.99th=[536576] bw (KB /s): min= 7148, max=193016, per=100.00%, avg=87866.39, stdev=28395.12 write: io=631998KB, bw=21999KB/s, iops=3590, runt= 28728msec slat (usec): min=4, max=9301, avg=12.69, stdev=33.41 clat (usec): min=299, max=778312, avg=3871.08, stdev=7378.66 lat (usec): min=305, max=778320, avg=3884.17, stdev=7378.66 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 4], | 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 5], 95.00th=[ 7], | 99.00th=[ 13], 99.50th=[ 19], 99.90th=[ 55], 99.95th=[ 101], | 99.99th=[ 537] bw (KB /s): min= 1524, max=46713, per=100.00%, avg=22089.18, stdev=7184.64 lat (usec) : 500=0.01%, 750=0.03%, 1000=0.06% lat (msec) : 2=1.29%, 4=88.78%, 10=8.94%, 20=0.56%, 50=0.22% lat (msec) : 100=0.05%, 250=0.05%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=8.24%, sys=28.49%, ctx=451227, majf=0, minf=8 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=411627/w=103162/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: io=2454.9MB, aggrb=87501KB/s, minb=87501KB/s, maxb=87501KB/s, mint=28728msec, maxt=28728msec WRITE: io=631997KB, aggrb=21999KB/s, minb=21999KB/s, maxb=21999KB/s, mint=28728msec, maxt=28728msec Disk stats (read/write): sda: ios=407383/102110, merge=123/54, ticks=1413272/456272, in_queue=1869620, util=99.71% -device scsi-generic iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, iodepth=64 fio-2.1.11 Starting 1 process iometer: Laying out IO file(s) (1 file(s) / 3072MB) Jobs: 1 (f=1): [m(1)] [100.0% done] [64339KB/16908KB/0KB /s] [15.2K/3816/0 iops] [eta 00m:00s] iometer: (groupid=0, jobs=1): err= 0: pid=701: Fri Sep 30 01:20:45 2016 Description : [Emulation of Intel IOmeter File Server Access Pattern] read : io=2454.9MB, bw=88384KB/s, iops=14473, runt= 28441msec slat (usec): min=5, max=5814, avg=10.86, stdev=21.71 clat (usec): min=459, max=885935, avg=3451.71, stdev=3297.21 lat (usec): min=526, max=885944, avg=3462.97, stdev=3297.14 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 4], | 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 4], | 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 4], 95.00th=[ 5], | 99.00th=[ 8], 99.50th=[ 11], 99.90th=[ 23], 99.95th=[ 63], | 99.99th=[ 153] bw (KB /s): min=46295, max=139025, per=100.00%, avg=88833.25, stdev=22609.61 write: io=631998KB, bw=1KB/s, iops=3627, runt= 28441msec slat (usec): min=6, max=3864, avg=12.96, stdev=24.18 clat (usec): min=582, max=156777, avg=3801.87, stdev=3128.06 lat (usec): min=610, max=156789, avg=3815.24, stdev=3128.36 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 4], | 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 4], | 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 5], 95.00th=[ 7], |
Re: [pve-devel] pve-manager and disk IO monitoring
On Thu, 29 Sep 2016 09:17:56 +0300 Dmitry Petuhovwrote: > It's side effect of scsi pass-through, which is being used by default for > [libi]scsi volumes with scsi VM disk interface. QEMU is just not aware of VM > block IO in that case. Also, cache settings for volumes are ineffective, > because qemu is just proxying raw scsi commands to backing storage, so > caching is impossible. > > Do you use PVE backups (vzdump)? Is it works for machines without stats? I > think it's also shall not work with pass-through. > What do you mean by pass-through? (no pass-through is happening here since the storage resides on a SAN) And yes, vzdump works for these machines. -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael rasmussen cc http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E mir datanom net http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C mir miras org http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917 -- /usr/games/fortune -es says: Mad, adj.: Affected with a high degree of intellectual independence ... -- Ambrose Bierce, "The Devil's Dictionary" pgp3SVtE9ervA.pgp Description: OpenPGP digital signature ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pve-manager and disk IO monitoring
On Thu, 29 Sep 2016 09:41:35 +0300 Dmitry Petuhovwrote: > In QemuServer.pm (some code omitted): > > if ($drive->{interface} eq 'scsi') > my $devicetype = 'hd'; > if($path =~ m/^iscsi\:\/\//){ > $devicetype = 'generic'; > } > $device = "scsi-$devicetype ... > > So usually if drive interface is scsi, PVE uses fully-emulated qemu device > 'scsi-hd'. But for iscsi: volumes (iscsi direct and zfs over iscsi) it uses > 'scsi-generic' device, which just proxies scsi commands between guest OS and > your SAN's iscsi target. > I see. So currently by using scsi-generic you sort off disable all qemu block features like monitoring etc. ? -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael rasmussen cc http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E mir datanom net http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C mir miras org http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917 -- /usr/games/fortune -es says: There is nothing wrong with Southern California that a rise in the ocean level wouldn't cure. -- Ross MacDonald pgpUqvBZbELeE.pgp Description: OpenPGP digital signature ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] student question about ha "restricted" option
> Maybe could we improve the documentation, and add some examples ? We are working on that since several months. People can also send patches ;-) BTW, do you know the HA simulator? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH] Improvement for pve-installation
this patch make some changes, but it is not clear why exactly? There are some simple fixes, but also some bigger rewrites. Please can you split that into smaller patches, and add a reasonable commit message to each change? ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pve-manager and disk IO monitoring
On Thu, 29 Sep 2016 07:38:09 +0200 (CEST) Alexandre DERUMIERwrote: > iostats are coming from qemu. > > what is the output of monitor "info blockstats" for the vm where you don't > have stats ? > > Two examples below: # info blockstats drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 drive-scsi1: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 # info blockstats drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael rasmussen cc http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E mir datanom net http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C mir miras org http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917 -- /usr/games/fortune -es says: If you sell diamonds, you cannot expect to have many customers. But a diamond is a diamond even if there are no customers. -- Swami Prabhupada pgpWhr8UEs41J.pgp Description: OpenPGP digital signature ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] Two-Node HA
Hi Andreas, On 09/28/2016 05:53 PM, Andreas Steinel wrote: Hi Thomas, Thank you for your time and your answer. I wonder why e.g. an Oracle Real Application Cluster (RAC) works so well with a 2 node in a HA setup. We deployed 50+ clusters in the last years and never had a split-brain-like situation. Rolling-updates, as well as occasional host crashes are also possible without loosing data - sometimes even sessions. If you use Transparent Failover (TAF), your database sessions will be migrated to the other node, rolled back and restarted (of course application support required on the "client" side). It's not perfect, but most of the time. We had only a few total crashes, but mainly due to storage issues, but also due to some bugs in the cluster stack. A bit of a lengthy explanation below why this comparison may not work, IMHO. The ORAC and a Proxmox VE do different stuff, one is a application with quasi fail-silent characteristics running on the application level, the other is an operating system running on bare metal, with byzantine errors possible. With RAC you serve clients, if a client does not reach you he ask another server, if you're dead you sync up when starting again, you are a closed system which know what runs inside and how the other server react if something happens, but if the communication between clusters are broken but not between clients and two clients write on the same dataset, each on another server, each with other data you will also get problems, a merge conflict, in certain situation you can solve it, databases are here often simpler as they can just say the newer entry "wins" and the older is out to date and would have been over written nonetheless, so I guess here RAC can utilize this. But what to you do if two VMs write on the same block on a shared storage, the block can for each VM represent a different thing, a decision without manual intervention is here in general impossible. I mean our cluster filesystem can work also like this and has never (known) split brains, even in two node clusters when one failed and the other was set to have quorum, we have a (relative) small task to solve and have thus more possibilities on less possible errors, as we have less to think about it. So it's not that we are in general not able to do such things but there are different limitation when doing different things. :) As Proxmox VE serves Virtual Guest systems and effectively knows nothing about them and has a harder time ensuring that if it recovers it really recovers and does not cause more corruption than recovery. Also there is shared access to resources, storage as already mentioned above, or IP address collisions, ... So as "third level" disaster recovery (first being application level, second hardware level) we need stricter rules to follow, we need fencing and we need to ensure that we are not a failed node itself, thus we need quorum. And quorum between two nodes will get you a tie in the case of a failure. In a lot of cases you could buy three a little bit smaller ones instead of two heavy machines, more redundancy, better load balancing possible, real HA possible, but yes, it may be not suitable in every situation - I understand that. Also you need three nonetheless, 2 PVE + shared storage, so a possibility would be also removing the shared storage node (which probably is a single point of failure one way or the other and surely not cheap) and use three nodes with a decentralized storage technology, ceph, gluster, sheepdog, ... So nothing against two node clusters, those are really great for a lot of people but if someone wants really HA then those are not enough, also simply three nodes are not enough, redundancy has to happen at all levels then: power supplies, network, shared storage ... Nevertheless, it's very good to see that a simple third vote solution is on the horizon, which could be easily integrated in a RPi or an even less "powerhungry" machine. I would not mark the RPi as "powerhungry" :D But yes its a cool idea in general. cheers, Thomas Best, Andreas On Wed, Sep 28, 2016 at 3:46 PM, Thomas Lamprechtwrote: Hi, QDisks are not ideal and those itself will probably not supported by Proxmox VE, also I would really love top see the term "two node HA" vanish, as its only marketing talk and is technically simply not possible (sadly basic rules of our universe make it impossible), they call a setup with three voters (the two nodes + the storage node) two node HA to sound better... That said, rant aside, there are plans to add the corosync (our cluster communication stack) QDevice daemon which allows then qdevices (at the moment there is only QNetd) to provide votes for one or more cluster. This QNetd device may run on a non Proxmox VE node and uses TCP/IP to communicate with the cluster. So you can have a two node cluster, setup the qdevice daemon there and the qnetd daemon on your storage box which then provides the
Re: [pve-devel] pve-manager and disk IO monitoring
29.09.2016 09:05, Michael Rasmussen wrote: On Thu, 29 Sep 2016 07:38:09 +0200 (CEST) Alexandre DERUMIERwrote: iostats are coming from qemu. what is the output of monitor "info blockstats" for the vm where you don't have stats ? Two examples below: # info blockstats drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 drive-scsi1: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 # info blockstats drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=0 It's side effect of scsi pass-through, which is being used by default for [libi]scsi volumes with scsi VM disk interface. QEMU is just not aware of VM block IO in that case. Also, cache settings for volumes are ineffective, because qemu is just proxying raw scsi commands to backing storage, so caching is impossible. Do you use PVE backups (vzdump)? Is it works for machines without stats? I think it's also shall not work with pass-through. ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pve-manager and disk IO monitoring
29.09.2016 09:21, Michael Rasmussen пишет: On Thu, 29 Sep 2016 09:17:56 +0300 Dmitry Petuhovwrote: It's side effect of scsi pass-through, which is being used by default for [libi]scsi volumes with scsi VM disk interface. QEMU is just not aware of VM block IO in that case. Also, cache settings for volumes are ineffective, because qemu is just proxying raw scsi commands to backing storage, so caching is impossible. Do you use PVE backups (vzdump)? Is it works for machines without stats? I think it's also shall not work with pass-through. What do you mean by pass-through? (no pass-through is happening here since the storage resides on a SAN) In QemuServer.pm (some code omitted): if ($drive->{interface} eq 'scsi') my $devicetype = 'hd'; if($path =~ m/^iscsi\:\/\//){ $devicetype = 'generic'; } $device = "scsi-$devicetype ... So usually if drive interface is scsi, PVE uses fully-emulated qemu device 'scsi-hd'. But for iscsi: volumes (iscsi direct and zfs over iscsi) it uses 'scsi-generic' device, which just proxies scsi commands between guest OS and your SAN's iscsi target. BTW, I began write code to on|off pass-through in storage's config, so that we could force it off, even if it can be used. If developers are interested, I can find it. ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH v3 storage 0/4] improve SMART handling
applied ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH] ha-manager: add examples to group settings
Signed-off-by: Thomas Lamprecht--- This should help people to understand the settings better. ha-manager.adoc | 22 ++ 1 file changed, 22 insertions(+) diff --git a/ha-manager.adoc b/ha-manager.adoc index 4a9e81a..c7c65f4 100644 --- a/ha-manager.adoc +++ b/ha-manager.adoc @@ -395,17 +395,39 @@ A service bound to this group will run on the nodes with the highest priority available. If more nodes are in the highest priority class the services will get distributed to those node if not already there. The priorities have a relative meaning only. + Example;; + You want to run all services from a group on node1 if possible, if this node + is not available you want them to run equally splitted on node2 and node3 and + if those fail it should use the other group members. + To achieve this you could set the node list to: +[source,bash] + ha-manager groupset mygroup -nodes "node1:2,node2:1,node3:1,node4" restricted:: Resources bound to this group may only run on nodes defined by the group. If no group node member is available the resource will be placed in the stopped state. + Example;; + A Service can run just on a few nodes, as he uses resources from only found + on those, we created a group with said nodes and as we know that else all + other nodes get implicitly added with lowest priority we set the restricted + option. nofailback:: The resource won't automatically fail back when a more preferred node (re)joins the cluster. + Examples;; + * You need to migrate a service to a node which hasn't the highest priority + in the group at the moment, to tell the HA manager to not move this service + instantly back set the nofailnback option and the service will stay on + + * A service was fenced and he got recovered to another node. The admin + repaired the node and brang it up online again but does not want that the + recovered services move straight back to the repaired node as he wants to + first investigate the failure cause and check if it runs stable. He can use + the nofailback option to achieve this. Start Failure Policy -- 2.1.4 ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] Two-Node HA
Hello, another option for 2-Node HA would be what HA-Lizzard for XenServer does. Basically they test if some external ip`s (e.g. storage/switches) can be reached to ensure quorum/majority in the two-node setup. This is maybe not the best solution but way better than running into split-brain. Alex Am 29.09.16 um 08:09 schrieb Thomas Lamprecht: Hi Andreas, On 09/28/2016 05:53 PM, Andreas Steinel wrote: Hi Thomas, Thank you for your time and your answer. I wonder why e.g. an Oracle Real Application Cluster (RAC) works so well with a 2 node in a HA setup. We deployed 50+ clusters in the last years and never had a split-brain-like situation. Rolling-updates, as well as occasional host crashes are also possible without loosing data - sometimes even sessions. If you use Transparent Failover (TAF), your database sessions will be migrated to the other node, rolled back and restarted (of course application support required on the "client" side). It's not perfect, but most of the time. We had only a few total crashes, but mainly due to storage issues, but also due to some bugs in the cluster stack. A bit of a lengthy explanation below why this comparison may not work, IMHO. The ORAC and a Proxmox VE do different stuff, one is a application with quasi fail-silent characteristics running on the application level, the other is an operating system running on bare metal, with byzantine errors possible. With RAC you serve clients, if a client does not reach you he ask another server, if you're dead you sync up when starting again, you are a closed system which know what runs inside and how the other server react if something happens, but if the communication between clusters are broken but not between clients and two clients write on the same dataset, each on another server, each with other data you will also get problems, a merge conflict, in certain situation you can solve it, databases are here often simpler as they can just say the newer entry "wins" and the older is out to date and would have been over written nonetheless, so I guess here RAC can utilize this. But what to you do if two VMs write on the same block on a shared storage, the block can for each VM represent a different thing, a decision without manual intervention is here in general impossible. I mean our cluster filesystem can work also like this and has never (known) split brains, even in two node clusters when one failed and the other was set to have quorum, we have a (relative) small task to solve and have thus more possibilities on less possible errors, as we have less to think about it. So it's not that we are in general not able to do such things but there are different limitation when doing different things. :) As Proxmox VE serves Virtual Guest systems and effectively knows nothing about them and has a harder time ensuring that if it recovers it really recovers and does not cause more corruption than recovery. Also there is shared access to resources, storage as already mentioned above, or IP address collisions, ... So as "third level" disaster recovery (first being application level, second hardware level) we need stricter rules to follow, we need fencing and we need to ensure that we are not a failed node itself, thus we need quorum. And quorum between two nodes will get you a tie in the case of a failure. In a lot of cases you could buy three a little bit smaller ones instead of two heavy machines, more redundancy, better load balancing possible, real HA possible, but yes, it may be not suitable in every situation - I understand that. Also you need three nonetheless, 2 PVE + shared storage, so a possibility would be also removing the shared storage node (which probably is a single point of failure one way or the other and surely not cheap) and use three nodes with a decentralized storage technology, ceph, gluster, sheepdog, ... So nothing against two node clusters, those are really great for a lot of people but if someone wants really HA then those are not enough, also simply three nodes are not enough, redundancy has to happen at all levels then: power supplies, network, shared storage ... Nevertheless, it's very good to see that a simple third vote solution is on the horizon, which could be easily integrated in a RPi or an even less "powerhungry" machine. I would not mark the RPi as "powerhungry" :D But yes its a cool idea in general. cheers, Thomas Best, Andreas On Wed, Sep 28, 2016 at 3:46 PM, Thomas Lamprechtwrote: Hi, QDisks are not ideal and those itself will probably not supported by Proxmox VE, also I would really love top see the term "two node HA" vanish, as its only marketing talk and is technically simply not possible (sadly basic rules of our universe make it impossible), they call a setup with three voters (the two nodes + the storage node) two node HA to sound better... That said, rant aside, there are plans to add the corosync (our cluster
[pve-devel] backup suspend mode with guest agent enable : fsfreeze timeout
Hi, if we try to run backup in pause mode with guest agent, it seem that the fsfreeze qmp command is send after the suspend, so the guest agent is not responding. INFO: suspend vm INFO: snapshots found (not included into backup) INFO: creating archive '/var/lib/vz/dump/vzdump-qemu-104-2016_09_29-12_06_13.vma.lzo' ERROR: VM 104 qmp command 'guest-fsfreeze-freeze' failed - got timeout ERROR: VM 104 qmp command 'guest-fsfreeze-thaw' failed - got timeout ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH v2 common] Network: add disable_ipv6 and use it
applied ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH v2 common] Network: add disable_ipv6 and use it
Many interfaces used to get an ipv6 link-local address which was usually unusable and therefore pointless. In order to ensure consistency this is called in various places: * $bridge_add_interface() and $ovs_bridge_add_port() because it's generally a good choice for bridge ports. * tap_create() and veth_create() because the activate the interfaces and we want to avoid the link local address to exist temporarily between bringing the interface up and adding it to a bridge. * create_firewall_bridge_*() because firewall bridges aren't meant to have addresses either. * activate_bridge_vlan() - if vlan_filtering is disabled we create vlan-bridges and neither them nor their physical ports should have link local addresses. --- Changes since v1 just cleanups: * use existing $ifacevlan variable instead of rebuilding it * replaced an `ip link set * up` by $activate_interface() src/PVE/Network.pm | 27 --- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/src/PVE/Network.pm b/src/PVE/Network.pm index b760c42..3a0d778 100644 --- a/src/PVE/Network.pm +++ b/src/PVE/Network.pm @@ -171,9 +171,20 @@ my $cond_create_bridge = sub { } }; +sub disable_ipv6 { +my ($iface) = @_; +return if !-d '/proc/sys/net/ipv6'; # ipv6 might be completely disabled +my $file = "/proc/sys/net/ipv6/conf/$iface/disable_ipv6"; +open(my $fh, '>', $file) or die "failed to open $file for writing: $!\n"; +print {$fh} "1\n" or die "failed to disable link-local ipv6 for $iface\n"; +close($fh); +} + my $bridge_add_interface = sub { my ($bridge, $iface, $tag, $trunks) = @_; +# drop link local address (it can't be used when on a bridge anyway) +disable_ipv6($iface); system("/sbin/brctl addif $bridge $iface") == 0 || die "can't add interface 'iface' to bridge '$bridge'\n"; @@ -215,6 +226,7 @@ my $ovs_bridge_add_port = sub { $cmd .= " -- set Interface $iface type=internal" if $internal; system($cmd) == 0 || die "can't add ovs port '$iface'\n"; +disable_ipv6($iface); }; my $activate_interface = sub { @@ -232,6 +244,7 @@ sub tap_create { my $bridgemtu = &$read_bridge_mtu($bridge); eval { + disable_ipv6($iface); PVE::Tools::run_command("/sbin/ifconfig $iface 0.0.0.0 promisc up mtu $bridgemtu"); }; die "interface activation failed\n" if $@; @@ -252,6 +265,8 @@ sub veth_create { } # up vethpair +disable_ipv6($veth); +disable_ipv6($vethpeer); &$activate_interface($veth); &$activate_interface($vethpeer); } @@ -272,6 +287,7 @@ my $create_firewall_bridge_linux = sub { my ($fwbr, $vethfw, $vethfwpeer) = &$compute_fwbr_names($vmid, $devid); &$cond_create_bridge($fwbr); +disable_ipv6($fwbr); &$activate_interface($fwbr); copy_bridge_config($bridge, $fwbr); @@ -292,6 +308,7 @@ my $create_firewall_bridge_ovs = sub { my $bridgemtu = &$read_bridge_mtu($bridge); &$cond_create_bridge($fwbr); +disable_ipv6($fwbr); &$activate_interface($fwbr); &$bridge_add_interface($fwbr, $iface); @@ -410,10 +427,13 @@ sub activate_bridge_vlan_slave { # create vlan on $iface is not already exist if (! -d "/sys/class/net/$ifacevlan") { - system("/sbin/ip link add link $iface name ${iface}.${tag} type vlan id $tag") == 0 || + system("/sbin/ip link add link $iface name $ifacevlan type vlan id $tag") == 0 || die "can't add vlan tag $tag to interface $iface\n"; } +# remove ipv6 link-local address before activation +disable_ipv6($ifacevlan); + # be sure to have the $ifacevlan up &$activate_interface($ifacevlan); @@ -468,9 +488,10 @@ sub activate_bridge_vlan { #fixme: set other bridge flags + # remove ipv6 link-local address before activation + disable_ipv6($bridgevlan); # be sure to have the bridge up - system("/sbin/ip link set $bridgevlan up") == 0 || - die "can't up bridge $bridgevlan\n"; + &$activate_interface($bridgevlan); }); return $bridgevlan; } -- 2.1.4 ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] pve-manager and disk IO monitoring
On Fri, 30 Sep 2016 00:51:06 +0200 Michael Rasmussenwrote: > > So my question is: Why use scsi-generic instead of scsi-block when > scsi-generic prevents blockstats? > Running a fio test also only shows marginal performance difference between scsi-block and scsi-generic -device scsi-block iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, iodepth=64 fio-2.1.11 Starting 1 process iometer: Laying out IO file(s) (1 file(s) / 3072MB) Jobs: 1 (f=1): [m(1)] [100.0% done] [73928KB/19507KB/0KB /s] [17.3K/4381/0 iops] [eta 00m:00s] iometer: (groupid=0, jobs=1): err= 0: pid=1568: Fri Sep 30 01:17:05 2016 Description : [Emulation of Intel IOmeter File Server Access Pattern] read : io=2454.9MB, bw=87501KB/s, iops=14328, runt= 28728msec slat (usec): min=2, max=4703, avg=10.47, stdev=16.99 clat (usec): min=315, max=1505.6K, avg=3479.55, stdev=8270.22 lat (usec): min=321, max=1505.6K, avg=3490.40, stdev=8270.14 clat percentiles (usec): | 1.00th=[ 1768], 5.00th=[ 2480], 10.00th=[ 2640], 20.00th=[ 2864], | 30.00th=[ 2960], 40.00th=[ 3056], 50.00th=[ 3088], 60.00th=[ 3152], | 70.00th=[ 3248], 80.00th=[ 3376], 90.00th=[ 3824], 95.00th=[ 4448], | 99.00th=[ 8768], 99.50th=[13120], 99.90th=[52992], 99.95th=[103936], | 99.99th=[536576] bw (KB /s): min= 7148, max=193016, per=100.00%, avg=87866.39, stdev=28395.12 write: io=631998KB, bw=21999KB/s, iops=3590, runt= 28728msec slat (usec): min=4, max=9301, avg=12.69, stdev=33.41 clat (usec): min=299, max=778312, avg=3871.08, stdev=7378.66 lat (usec): min=305, max=778320, avg=3884.17, stdev=7378.66 clat percentiles (msec): | 1.00th=[3], 5.00th=[3], 10.00th=[3], 20.00th=[3], | 30.00th=[4], 40.00th=[4], 50.00th=[4], 60.00th=[4], | 70.00th=[4], 80.00th=[4], 90.00th=[5], 95.00th=[7], | 99.00th=[ 13], 99.50th=[ 19], 99.90th=[ 55], 99.95th=[ 101], | 99.99th=[ 537] bw (KB /s): min= 1524, max=46713, per=100.00%, avg=22089.18, stdev=7184.64 lat (usec) : 500=0.01%, 750=0.03%, 1000=0.06% lat (msec) : 2=1.29%, 4=88.78%, 10=8.94%, 20=0.56%, 50=0.22% lat (msec) : 100=0.05%, 250=0.05%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=8.24%, sys=28.49%, ctx=451227, majf=0, minf=8 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued: total=r=411627/w=103162/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: io=2454.9MB, aggrb=87501KB/s, minb=87501KB/s, maxb=87501KB/s, mint=28728msec, maxt=28728msec WRITE: io=631997KB, aggrb=21999KB/s, minb=21999KB/s, maxb=21999KB/s, mint=28728msec, maxt=28728msec Disk stats (read/write): sda: ios=407383/102110, merge=123/54, ticks=1413272/456272, in_queue=1869620, util=99.71% -device scsi-generic iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, iodepth=64 fio-2.1.11 Starting 1 process iometer: Laying out IO file(s) (1 file(s) / 3072MB) Jobs: 1 (f=1): [m(1)] [100.0% done] [64339KB/16908KB/0KB /s] [15.2K/3816/0 iops] [eta 00m:00s] iometer: (groupid=0, jobs=1): err= 0: pid=701: Fri Sep 30 01:20:45 2016 Description : [Emulation of Intel IOmeter File Server Access Pattern] read : io=2454.9MB, bw=88384KB/s, iops=14473, runt= 28441msec slat (usec): min=5, max=5814, avg=10.86, stdev=21.71 clat (usec): min=459, max=885935, avg=3451.71, stdev=3297.21 lat (usec): min=526, max=885944, avg=3462.97, stdev=3297.14 clat percentiles (msec): | 1.00th=[3], 5.00th=[3], 10.00th=[3], 20.00th=[4], | 30.00th=[4], 40.00th=[4], 50.00th=[4], 60.00th=[4], | 70.00th=[4], 80.00th=[4], 90.00th=[4], 95.00th=[5], | 99.00th=[8], 99.50th=[ 11], 99.90th=[ 23], 99.95th=[ 63], | 99.99th=[ 153] bw (KB /s): min=46295, max=139025, per=100.00%, avg=88833.25, stdev=22609.61 write: io=631998KB, bw=1KB/s, iops=3627, runt= 28441msec slat (usec): min=6, max=3864, avg=12.96, stdev=24.18 clat (usec): min=582, max=156777, avg=3801.87, stdev=3128.06 lat (usec): min=610, max=156789, avg=3815.24, stdev=3128.36 clat percentiles (msec): | 1.00th=[3], 5.00th=[3], 10.00th=[3], 20.00th=[4], | 30.00th=[4], 40.00th=[4], 50.00th=[4], 60.00th=[4], | 70.00th=[4], 80.00th=[4], 90.00th=[5], 95.00th=[7], | 99.00th=[ 11], 99.50th=[ 15], 99.90th=[ 49], 99.95th=[ 74], | 99.99th=[ 153] bw (KB /s): min=11151, max=36378, per=100.00%, avg=22332.46, stdev=5869.71 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.67%, 4=90.61%, 10=8.03%,
Re: [pve-devel] pve-manager and disk IO monitoring
On Thu, 29 Sep 2016 07:38:09 +0200 (CEST) Alexandre DERUMIERwrote: > iostats are coming from qemu. > > what is the output of monitor "info blockstats" for the vm where you don't > have stats ? > I have just tested with replacing -device scsi-generic with -device scsi-block. Machine boots and seems to work and, low and behold I have disk IO stats again! # info blockstats drive-ide2: rd_bytes=152 wr_bytes=0 rd_operations=4 wr_operations=0 flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=95326 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=258695709709 drive-scsi0: rd_bytes=266729984 wr_bytes=1690120192 rd_operations=15168 wr_operations=6182 flush_operations=0 wr_total_time_ns=512105318651 rd_total_time_ns=61640872040 flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=6506064324 So my question is: Why use scsi-generic instead of scsi-block when scsi-generic prevents blockstats? -- Hilsen/Regards Michael Rasmussen Get my public GnuPG keys: michael rasmussen cc http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E mir datanom net http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C mir miras org http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917 -- /usr/games/fortune -es says: The last thing one knows in constructing a work is what to put first. -- Blaise Pascal pgpgsYMEmW_Jp.pgp Description: OpenPGP digital signature ___ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel