Re: [pve-devel] Two-Node HA

2016-09-29 Thread Dietmar Maurer
FYI, there was support to use a quorum Disk in 3.X:

https://pve.proxmox.com/wiki/Two-Node_High_Availability_Cluster

But corosync developers decided to drop support for that. AFAIK this
technology is a leftover from old days, and nobody was happy with
that complex and error prone software. That for, they decided to 
implement a better way to provide quorum (qdevice, qnetd) ...

___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Dietmar Maurer
> So my question is: Why use scsi-generic instead of scsi-block when
> scsi-generic prevents blockstats?

commit d454d040338a6216c8d3e5cc9623d6223476cb5a
Author: Alexandre Derumier 
Date:   Tue Aug 28 12:46:07 2012 +0200

use scsi-generic by default with libiscsi

This add scsi passthrough with libiscsi

Signed-off-by: Alexandre Derumier 


@Alexandre: This was for performance reasons?

___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Alexandre DERUMIER
"Running a fio test also only shows marginal performance difference
between scsi-block and scsi-generic"

I think that 11% difference is not so marginal.
I'm curious to see difference with full flash array, if we have the same cpu 
iothread bottleneck like ceph, with scsi-block vs scsi-generic.

Maybe can we add an option to choose between scsi-block && scsi-generic




- Mail original -
De: "datanom.net" 
À: "pve-devel" 
Envoyé: Vendredi 30 Septembre 2016 01:23:20
Objet: Re: [pve-devel] pve-manager and disk IO monitoring

On Fri, 30 Sep 2016 00:51:06 +0200 
Michael Rasmussen  wrote: 

> 
> So my question is: Why use scsi-generic instead of scsi-block when 
> scsi-generic prevents blockstats? 
> 
Running a fio test also only shows marginal performance difference 
between scsi-block and scsi-generic 

-device scsi-block 
iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, 
iodepth=64 
fio-2.1.11 
Starting 1 process 
iometer: Laying out IO file(s) (1 file(s) / 3072MB) 
Jobs: 1 (f=1): [m(1)] [100.0% done] [73928KB/19507KB/0KB /s] [17.3K/4381/0 
iops] [eta 00m:00s] 
iometer: (groupid=0, jobs=1): err= 0: pid=1568: Fri Sep 30 01:17:05 2016 
Description : [Emulation of Intel IOmeter File Server Access Pattern] 
read : io=2454.9MB, bw=87501KB/s, iops=14328, runt= 28728msec 
slat (usec): min=2, max=4703, avg=10.47, stdev=16.99 
clat (usec): min=315, max=1505.6K, avg=3479.55, stdev=8270.22 
lat (usec): min=321, max=1505.6K, avg=3490.40, stdev=8270.14 
clat percentiles (usec): 
| 1.00th=[ 1768], 5.00th=[ 2480], 10.00th=[ 2640], 20.00th=[ 2864], 
| 30.00th=[ 2960], 40.00th=[ 3056], 50.00th=[ 3088], 60.00th=[ 3152], 
| 70.00th=[ 3248], 80.00th=[ 3376], 90.00th=[ 3824], 95.00th=[ 4448], 
| 99.00th=[ 8768], 99.50th=[13120], 99.90th=[52992], 99.95th=[103936], 
| 99.99th=[536576] 
bw (KB /s): min= 7148, max=193016, per=100.00%, avg=87866.39, stdev=28395.12 
write: io=631998KB, bw=21999KB/s, iops=3590, runt= 28728msec 
slat (usec): min=4, max=9301, avg=12.69, stdev=33.41 
clat (usec): min=299, max=778312, avg=3871.08, stdev=7378.66 
lat (usec): min=305, max=778320, avg=3884.17, stdev=7378.66 
clat percentiles (msec): 
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], 
| 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 4], 
| 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 5], 95.00th=[ 7], 
| 99.00th=[ 13], 99.50th=[ 19], 99.90th=[ 55], 99.95th=[ 101], 
| 99.99th=[ 537] 
bw (KB /s): min= 1524, max=46713, per=100.00%, avg=22089.18, stdev=7184.64 
lat (usec) : 500=0.01%, 750=0.03%, 1000=0.06% 
lat (msec) : 2=1.29%, 4=88.78%, 10=8.94%, 20=0.56%, 50=0.22% 
lat (msec) : 100=0.05%, 250=0.05%, 500=0.01%, 750=0.01%, 1000=0.01% 
lat (msec) : 2000=0.01% 
cpu : usr=8.24%, sys=28.49%, ctx=451227, majf=0, minf=8 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% 
issued : total=r=411627/w=103162/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=64 

Run status group 0 (all jobs): 
READ: io=2454.9MB, aggrb=87501KB/s, minb=87501KB/s, maxb=87501KB/s, 
mint=28728msec, maxt=28728msec 
WRITE: io=631997KB, aggrb=21999KB/s, minb=21999KB/s, maxb=21999KB/s, 
mint=28728msec, maxt=28728msec 

Disk stats (read/write): 
sda: ios=407383/102110, merge=123/54, ticks=1413272/456272, in_queue=1869620, 
util=99.71% 

-device scsi-generic 
iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, 
iodepth=64 
fio-2.1.11 
Starting 1 process 
iometer: Laying out IO file(s) (1 file(s) / 3072MB) 
Jobs: 1 (f=1): [m(1)] [100.0% done] [64339KB/16908KB/0KB /s] [15.2K/3816/0 
iops] [eta 00m:00s] 
iometer: (groupid=0, jobs=1): err= 0: pid=701: Fri Sep 30 01:20:45 2016 
Description : [Emulation of Intel IOmeter File Server Access Pattern] 
read : io=2454.9MB, bw=88384KB/s, iops=14473, runt= 28441msec 
slat (usec): min=5, max=5814, avg=10.86, stdev=21.71 
clat (usec): min=459, max=885935, avg=3451.71, stdev=3297.21 
lat (usec): min=526, max=885944, avg=3462.97, stdev=3297.14 
clat percentiles (msec): 
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 4], 
| 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 4], 
| 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 4], 95.00th=[ 5], 
| 99.00th=[ 8], 99.50th=[ 11], 99.90th=[ 23], 99.95th=[ 63], 
| 99.99th=[ 153] 
bw (KB /s): min=46295, max=139025, per=100.00%, avg=88833.25, stdev=22609.61 
write: io=631998KB, bw=1KB/s, iops=3627, runt= 28441msec 
slat (usec): min=6, max=3864, avg=12.96, stdev=24.18 
clat (usec): min=582, max=156777, avg=3801.87, stdev=3128.06 
lat (usec): min=610, max=156789, avg=3815.24, stdev=3128.36 
clat percentiles (msec): 
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 4], 
| 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], 60.00th=[ 4], 
| 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 5], 95.00th=[ 7], 
| 

Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Michael Rasmussen
On Thu, 29 Sep 2016 09:17:56 +0300
Dmitry Petuhov  wrote:

> It's side effect of scsi pass-through, which is being used by default for 
> [libi]scsi volumes with scsi VM disk interface. QEMU is just not aware of VM 
> block IO in that case. Also, cache settings for volumes are ineffective, 
> because qemu is just proxying raw scsi commands to backing storage, so 
> caching is impossible.
> 
> Do you use PVE backups (vzdump)? Is it works for machines without stats? I 
> think it's also shall not work with pass-through.
> 
What do you mean by pass-through? (no pass-through is happening here
since the storage resides on a SAN)

And yes, vzdump works for these machines.

-- 
Hilsen/Regards
Michael Rasmussen

Get my public GnuPG keys:
michael  rasmussen  cc
http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E
mir  datanom  net
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C
mir  miras  org
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917
--
/usr/games/fortune -es says:
Mad, adj.:
Affected with a high degree of intellectual independence ...
-- Ambrose Bierce, "The Devil's Dictionary"


pgp3SVtE9ervA.pgp
Description: OpenPGP digital signature
___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Michael Rasmussen
On Thu, 29 Sep 2016 09:41:35 +0300
Dmitry Petuhov  wrote:

> In QemuServer.pm (some code omitted):
> 
> if ($drive->{interface} eq 'scsi')
> my $devicetype = 'hd';
>  if($path =~ m/^iscsi\:\/\//){
>  $devicetype = 'generic';
>  }
> $device = "scsi-$devicetype ...
> 
> So usually if drive interface is scsi, PVE uses fully-emulated qemu device 
> 'scsi-hd'. But for iscsi: volumes (iscsi direct and zfs over iscsi) it uses 
> 'scsi-generic' device, which just proxies scsi commands between guest OS and 
> your SAN's iscsi target.
> 
I see. So currently by using scsi-generic you sort off disable all
qemu block features like monitoring etc. ?

-- 
Hilsen/Regards
Michael Rasmussen

Get my public GnuPG keys:
michael  rasmussen  cc
http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E
mir  datanom  net
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C
mir  miras  org
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917
--
/usr/games/fortune -es says:
There is nothing wrong with Southern California that a rise in the
ocean level wouldn't cure.
-- Ross MacDonald


pgpUqvBZbELeE.pgp
Description: OpenPGP digital signature
___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] student question about ha "restricted" option

2016-09-29 Thread Dietmar Maurer
> Maybe could we improve the documentation, and add some examples ?

We are working on that since several months. People can also send patches ;-)
 
BTW, do you know the HA simulator?

___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] [PATCH] Improvement for pve-installation

2016-09-29 Thread Dietmar Maurer
this patch make some changes, but it is not clear why exactly?

There are some simple fixes, but also some bigger rewrites.

Please can you split that into smaller patches, and add a reasonable commit
message to each change?

___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Michael Rasmussen
On Thu, 29 Sep 2016 07:38:09 +0200 (CEST)
Alexandre DERUMIER  wrote:

> iostats are coming from qemu.
> 
> what is the output of monitor "info blockstats" for the vm where you don't 
> have stats ?
> 
> 
Two examples below:
# info blockstats
drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
drive-scsi1: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
# info blockstats
drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0


-- 
Hilsen/Regards
Michael Rasmussen

Get my public GnuPG keys:
michael  rasmussen  cc
http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E
mir  datanom  net
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C
mir  miras  org
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917
--
/usr/games/fortune -es says:
If you sell diamonds, you cannot expect to have many customers.
But a diamond is a diamond even if there are no customers.
-- Swami Prabhupada


pgpWhr8UEs41J.pgp
Description: OpenPGP digital signature
___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] Two-Node HA

2016-09-29 Thread Thomas Lamprecht

Hi Andreas,


On 09/28/2016 05:53 PM, Andreas Steinel wrote:

Hi Thomas,

Thank you for your time and your answer.

I wonder why e.g. an Oracle Real Application Cluster (RAC) works so
well with a 2 node in a HA setup. We deployed 50+ clusters in the last
years and never had a split-brain-like situation. Rolling-updates, as
well as occasional host crashes are also possible without loosing data
- sometimes even sessions. If you use Transparent Failover (TAF), your
database sessions will be migrated to the other node, rolled back and
restarted (of course application support required on the "client"
side). It's not perfect, but most of the time. We had only a few total
crashes, but mainly due to storage issues, but also due to some bugs
in the cluster stack.


A bit of a lengthy explanation below why this comparison may not work,
IMHO.

The ORAC and a Proxmox VE do different stuff, one is a application with
quasi fail-silent characteristics running on the application level, the
other is an operating system running on bare metal, with byzantine errors
possible.

With RAC you serve clients, if a client does not reach you he ask another
server, if you're dead you sync up when starting again, you are a closed
system which know what runs inside and how the other server react if
something happens, but if the communication between clusters are broken but
not between clients and two clients write on the same dataset, each on
another server, each with other data you will also get problems, a merge
conflict, in certain situation you can solve it, databases are here often
simpler as they can just say the newer entry "wins" and the older is out to
date and would have been over written nonetheless, so I guess here RAC can
utilize this.
But what to you do if two VMs write on the same block on a shared storage,
the block can for each VM represent a different thing, a decision without
manual intervention is here in general impossible.

I mean our cluster filesystem can work also like this and has never (known)
split brains, even in two node clusters when one failed and the other was
set to have quorum, we have a (relative) small task to solve and have thus
more possibilities on less possible errors, as we have less to think about
it. So it's not that we are in general not able to do such things but there
are different limitation when doing different things. :)

As Proxmox VE serves Virtual Guest systems and effectively knows nothing
about them and has a harder time ensuring that if it recovers it really
recovers and does not cause more corruption than recovery.  Also there is
shared access to resources, storage as already mentioned above, or IP
address collisions, ...
So as "third level" disaster recovery (first being application level, second
hardware level) we need stricter rules to follow, we need fencing and we
need to ensure that we are not a failed node itself, thus we need quorum.
And quorum between two nodes will get you a tie in the case of a failure.

In a lot of cases you could buy three a little bit smaller ones instead of
two heavy machines, more redundancy, better load balancing possible, real HA
possible, but yes, it may be not suitable in every situation - I understand
that.
Also you need three nonetheless, 2 PVE + shared storage, so a possibility
would be also removing the shared storage node (which probably is a single
point of failure one way or the other and surely not cheap) and use three
nodes with a decentralized storage technology, ceph, gluster, sheepdog, ...

So nothing against two node clusters, those are really great for a lot of
people but if someone wants really HA then those are not enough, also
simply three nodes are not enough, redundancy has to happen at all levels
then: power supplies, network, shared storage ...



Nevertheless, it's very good to see that a simple third vote solution
is on the horizon, which could be easily integrated in a RPi or an
even less "powerhungry" machine.


I would not mark the RPi as "powerhungry" :D But yes its a cool idea in
general.

cheers,
Thomas


Best,
Andreas

On Wed, Sep 28, 2016 at 3:46 PM, Thomas Lamprecht
 wrote:

Hi,

QDisks are not ideal and those itself will probably not supported by Proxmox
VE, also I would really love top see the term "two node HA" vanish, as its
only marketing talk and is technically simply not possible (sadly basic
rules of our universe make it impossible), they call a setup with three
voters (the two nodes + the storage node) two node HA to sound better...

That said, rant aside, there are plans to add the corosync (our cluster
communication stack) QDevice daemon which allows then qdevices (at the
moment there is only QNetd) to provide votes for one or more cluster.

This QNetd device may run on a non Proxmox VE node and uses TCP/IP to
communicate with the cluster.

So you can have a two node cluster, setup the qdevice daemon there and the
qnetd daemon on your storage box which then provides the 

Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Dmitry Petuhov

29.09.2016 09:05, Michael Rasmussen wrote:

On Thu, 29 Sep 2016 07:38:09 +0200 (CEST)
Alexandre DERUMIER  wrote:


iostats are coming from qemu.

what is the output of monitor "info blockstats" for the vm where you don't have 
stats ?



Two examples below:
# info blockstats
drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
drive-scsi1: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
# info blockstats
drive-ide2: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
drive-scsi0: rd_bytes=0 wr_bytes=0 rd_operations=0 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=0 flush_total_time_ns=0 
rd_merged=0 wr_merged=0 idle_time_ns=0
It's side effect of scsi pass-through, which is being used by default 
for [libi]scsi volumes with scsi VM disk interface. QEMU is just not 
aware of VM block IO in that case. Also, cache settings for volumes are 
ineffective, because qemu is just proxying raw scsi commands to backing 
storage, so caching is impossible.


Do you use PVE backups (vzdump)? Is it works for machines without stats? 
I think it's also shall not work with pass-through.


___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Dmitry Petuhov

29.09.2016 09:21, Michael Rasmussen пишет:

On Thu, 29 Sep 2016 09:17:56 +0300
Dmitry Petuhov  wrote:


It's side effect of scsi pass-through, which is being used by default for 
[libi]scsi volumes with scsi VM disk interface. QEMU is just not aware of VM 
block IO in that case. Also, cache settings for volumes are ineffective, 
because qemu is just proxying raw scsi commands to backing storage, so caching 
is impossible.

Do you use PVE backups (vzdump)? Is it works for machines without stats? I 
think it's also shall not work with pass-through.


What do you mean by pass-through? (no pass-through is happening here
since the storage resides on a SAN)

In QemuServer.pm (some code omitted):

if ($drive->{interface} eq 'scsi')
my $devicetype = 'hd';
if($path =~ m/^iscsi\:\/\//){
$devicetype = 'generic';
}
$device = "scsi-$devicetype ...

So usually if drive interface is scsi, PVE uses fully-emulated qemu 
device 'scsi-hd'. But for iscsi: volumes (iscsi direct and zfs over 
iscsi) it uses 'scsi-generic' device, which just proxies scsi commands 
between guest OS and your SAN's iscsi target.


BTW, I began write code to on|off pass-through in storage's config, so 
that we could force it off, even if it can be used. If developers are 
interested, I can find it.


___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] [PATCH v3 storage 0/4] improve SMART handling

2016-09-29 Thread Dietmar Maurer
applied

___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[pve-devel] [PATCH] ha-manager: add examples to group settings

2016-09-29 Thread Thomas Lamprecht
Signed-off-by: Thomas Lamprecht 
---

This should help people to understand the settings better.

 ha-manager.adoc | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/ha-manager.adoc b/ha-manager.adoc
index 4a9e81a..c7c65f4 100644
--- a/ha-manager.adoc
+++ b/ha-manager.adoc
@@ -395,17 +395,39 @@ A service bound to this group will run on the nodes with 
the highest priority
 available. If more nodes are in the highest priority class the services will
 get distributed to those node if not already there. The priorities have a
 relative meaning only.
+  Example;;
+  You want to run all services from a group on node1 if possible, if this node
+  is not available you want them to run equally splitted on node2 and node3 and
+  if those fail it should use the other group members.
+  To achieve this you could set the node list to:
+[source,bash]
+  ha-manager groupset mygroup -nodes "node1:2,node2:1,node3:1,node4"
 
 restricted::
 
 Resources bound to this group may only run on nodes defined by the
 group. If no group node member is available the resource will be
 placed in the stopped state.
+  Example;;
+  A Service can run just on a few nodes, as he uses resources from only found
+  on those, we created a group with said nodes and as we know that else all
+  other nodes get implicitly added with lowest priority we set the restricted
+  option.
 
 nofailback::
 
 The resource won't automatically fail back when a more preferred node
 (re)joins the cluster.
+  Examples;;
+  * You need to migrate a service to a node which hasn't the highest priority
+  in the group at the moment, to tell the HA manager to not move this service
+  instantly back set the nofailnback option and the service will stay on
+
+  * A service was fenced and he got recovered to another node. The admin
+  repaired the node and brang it up online again but does not want that the
+  recovered services move straight back to the repaired node as he wants to
+  first investigate the failure cause and check if it runs stable. He can use
+  the nofailback option to achieve this.
 
 
 Start Failure Policy
-- 
2.1.4


___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] Two-Node HA

2016-09-29 Thread Alexander Schmid

Hello,
another option for 2-Node HA would be what HA-Lizzard for XenServer does.
Basically they test if some external ip`s (e.g. storage/switches) can be 
reached to ensure quorum/majority in the two-node setup. This is maybe 
not the best solution but way better than running into split-brain.


Alex

Am 29.09.16 um 08:09 schrieb Thomas Lamprecht:

Hi Andreas,


On 09/28/2016 05:53 PM, Andreas Steinel wrote:

Hi Thomas,

Thank you for your time and your answer.

I wonder why e.g. an Oracle Real Application Cluster (RAC) works so
well with a 2 node in a HA setup. We deployed 50+ clusters in the last
years and never had a split-brain-like situation. Rolling-updates, as
well as occasional host crashes are also possible without loosing data
- sometimes even sessions. If you use Transparent Failover (TAF), your
database sessions will be migrated to the other node, rolled back and
restarted (of course application support required on the "client"
side). It's not perfect, but most of the time. We had only a few total
crashes, but mainly due to storage issues, but also due to some bugs
in the cluster stack.


A bit of a lengthy explanation below why this comparison may not work,
IMHO.

The ORAC and a Proxmox VE do different stuff, one is a application with
quasi fail-silent characteristics running on the application level, the
other is an operating system running on bare metal, with byzantine errors
possible.

With RAC you serve clients, if a client does not reach you he ask another
server, if you're dead you sync up when starting again, you are a closed
system which know what runs inside and how the other server react if
something happens, but if the communication between clusters are 
broken but

not between clients and two clients write on the same dataset, each on
another server, each with other data you will also get problems, a merge
conflict, in certain situation you can solve it, databases are here often
simpler as they can just say the newer entry "wins" and the older is 
out to
date and would have been over written nonetheless, so I guess here RAC 
can

utilize this.
But what to you do if two VMs write on the same block on a shared 
storage,

the block can for each VM represent a different thing, a decision without
manual intervention is here in general impossible.

I mean our cluster filesystem can work also like this and has never 
(known)

split brains, even in two node clusters when one failed and the other was
set to have quorum, we have a (relative) small task to solve and have 
thus
more possibilities on less possible errors, as we have less to think 
about
it. So it's not that we are in general not able to do such things but 
there

are different limitation when doing different things. :)

As Proxmox VE serves Virtual Guest systems and effectively knows nothing
about them and has a harder time ensuring that if it recovers it really
recovers and does not cause more corruption than recovery.  Also there is
shared access to resources, storage as already mentioned above, or IP
address collisions, ...
So as "third level" disaster recovery (first being application level, 
second

hardware level) we need stricter rules to follow, we need fencing and we
need to ensure that we are not a failed node itself, thus we need quorum.
And quorum between two nodes will get you a tie in the case of a failure.

In a lot of cases you could buy three a little bit smaller ones 
instead of
two heavy machines, more redundancy, better load balancing possible, 
real HA
possible, but yes, it may be not suitable in every situation - I 
understand

that.
Also you need three nonetheless, 2 PVE + shared storage, so a possibility
would be also removing the shared storage node (which probably is a 
single

point of failure one way or the other and surely not cheap) and use three
nodes with a decentralized storage technology, ceph, gluster, 
sheepdog, ...


So nothing against two node clusters, those are really great for a lot of
people but if someone wants really HA then those are not enough, also
simply three nodes are not enough, redundancy has to happen at all levels
then: power supplies, network, shared storage ...



Nevertheless, it's very good to see that a simple third vote solution
is on the horizon, which could be easily integrated in a RPi or an
even less "powerhungry" machine.


I would not mark the RPi as "powerhungry" :D But yes its a cool idea in
general.

cheers,
Thomas


Best,
Andreas

On Wed, Sep 28, 2016 at 3:46 PM, Thomas Lamprecht
 wrote:

Hi,

QDisks are not ideal and those itself will probably not supported by 
Proxmox
VE, also I would really love top see the term "two node HA" vanish, 
as its

only marketing talk and is technically simply not possible (sadly basic
rules of our universe make it impossible), they call a setup with three
voters (the two nodes + the storage node) two node HA to sound 
better...


That said, rant aside, there are plans to add the corosync (our cluster

[pve-devel] backup suspend mode with guest agent enable : fsfreeze timeout

2016-09-29 Thread Alexandre DERUMIER
Hi,

if we try to run backup in pause mode with guest agent,
it seem that the fsfreeze qmp command is send after the suspend, so the guest 
agent is not responding.


INFO: suspend vm
INFO: snapshots found (not included into backup)
INFO: creating archive 
'/var/lib/vz/dump/vzdump-qemu-104-2016_09_29-12_06_13.vma.lzo'
ERROR: VM 104 qmp command 'guest-fsfreeze-freeze' failed - got timeout
ERROR: VM 104 qmp command 'guest-fsfreeze-thaw' failed - got timeout


___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] [PATCH v2 common] Network: add disable_ipv6 and use it

2016-09-29 Thread Dietmar Maurer
applied

___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


[pve-devel] [PATCH v2 common] Network: add disable_ipv6 and use it

2016-09-29 Thread Wolfgang Bumiller
Many interfaces used to get an ipv6 link-local address which
was usually unusable and therefore pointless.

In order to ensure consistency this is called in various
places:
* $bridge_add_interface() and $ovs_bridge_add_port() because
  it's generally a good choice for bridge ports.
* tap_create() and veth_create() because the activate the
  interfaces and we want to avoid the link local address to
  exist temporarily between bringing the interface up and
  adding it to a bridge.
* create_firewall_bridge_*() because firewall bridges aren't
  meant to have addresses either.
* activate_bridge_vlan() - if vlan_filtering is disabled we
  create vlan-bridges and neither them nor their physical
  ports should have link local addresses.
---
Changes since v1 just cleanups:
 * use existing $ifacevlan variable instead of rebuilding it
 * replaced an `ip link set * up` by $activate_interface()

 src/PVE/Network.pm | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/src/PVE/Network.pm b/src/PVE/Network.pm
index b760c42..3a0d778 100644
--- a/src/PVE/Network.pm
+++ b/src/PVE/Network.pm
@@ -171,9 +171,20 @@ my $cond_create_bridge = sub {
 }
 };
 
+sub disable_ipv6 {
+my ($iface) = @_;
+return if !-d '/proc/sys/net/ipv6'; # ipv6 might be completely disabled
+my $file = "/proc/sys/net/ipv6/conf/$iface/disable_ipv6";
+open(my $fh, '>', $file) or die "failed to open $file for writing: $!\n";
+print {$fh} "1\n" or die "failed to disable link-local ipv6 for $iface\n";
+close($fh);
+}
+
 my $bridge_add_interface = sub {
 my ($bridge, $iface, $tag, $trunks) = @_;
 
+# drop link local address (it can't be used when on a bridge anyway)
+disable_ipv6($iface);
 system("/sbin/brctl addif $bridge $iface") == 0 ||
die "can't add interface 'iface' to bridge '$bridge'\n";
 
@@ -215,6 +226,7 @@ my $ovs_bridge_add_port = sub {
 $cmd .= " -- set Interface $iface type=internal" if $internal;
 system($cmd) == 0 ||
die "can't add ovs port '$iface'\n";
+disable_ipv6($iface);
 };
 
 my $activate_interface = sub {
@@ -232,6 +244,7 @@ sub tap_create {
 my $bridgemtu = &$read_bridge_mtu($bridge);
 
 eval { 
+   disable_ipv6($iface);
PVE::Tools::run_command("/sbin/ifconfig $iface 0.0.0.0 promisc up mtu 
$bridgemtu");
 };
 die "interface activation failed\n" if $@;
@@ -252,6 +265,8 @@ sub veth_create {
 }
 
 # up vethpair
+disable_ipv6($veth);
+disable_ipv6($vethpeer);
 &$activate_interface($veth);
 &$activate_interface($vethpeer);
 }
@@ -272,6 +287,7 @@ my $create_firewall_bridge_linux = sub {
 my ($fwbr, $vethfw, $vethfwpeer) = &$compute_fwbr_names($vmid, $devid);
 
 &$cond_create_bridge($fwbr);
+disable_ipv6($fwbr);
 &$activate_interface($fwbr);
 
 copy_bridge_config($bridge, $fwbr);
@@ -292,6 +308,7 @@ my $create_firewall_bridge_ovs = sub {
 my $bridgemtu = &$read_bridge_mtu($bridge);
 
 &$cond_create_bridge($fwbr);
+disable_ipv6($fwbr);
 &$activate_interface($fwbr);
 
 &$bridge_add_interface($fwbr, $iface);
@@ -410,10 +427,13 @@ sub activate_bridge_vlan_slave {

 # create vlan on $iface is not already exist
 if (! -d "/sys/class/net/$ifacevlan") {
-   system("/sbin/ip link add link $iface name ${iface}.${tag} type vlan id 
$tag") == 0 ||
+   system("/sbin/ip link add link $iface name $ifacevlan type vlan id 
$tag") == 0 ||
die "can't add vlan tag $tag to interface $iface\n";
 }
 
+# remove ipv6 link-local address before activation
+disable_ipv6($ifacevlan);
+
 # be sure to have the $ifacevlan up
 &$activate_interface($ifacevlan);
 
@@ -468,9 +488,10 @@ sub activate_bridge_vlan {
 
#fixme: set other bridge flags
 
+   # remove ipv6 link-local address before activation
+   disable_ipv6($bridgevlan);
# be sure to have the bridge up
-   system("/sbin/ip link set $bridgevlan up") == 0 ||
-   die "can't up bridge $bridgevlan\n";
+   &$activate_interface($bridgevlan);
 });
 return $bridgevlan;
 }
-- 
2.1.4


___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Michael Rasmussen
On Fri, 30 Sep 2016 00:51:06 +0200
Michael Rasmussen  wrote:

> 
> So my question is: Why use scsi-generic instead of scsi-block when
> scsi-generic prevents blockstats?
> 
Running a fio test also only shows marginal performance difference
between scsi-block and scsi-generic

-device scsi-block
iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, 
iodepth=64
fio-2.1.11
Starting 1 process
iometer: Laying out IO file(s) (1 file(s) / 3072MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [73928KB/19507KB/0KB /s] [17.3K/4381/0 
iops] [eta 00m:00s]
iometer: (groupid=0, jobs=1): err= 0: pid=1568: Fri Sep 30 01:17:05 2016
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read : io=2454.9MB, bw=87501KB/s, iops=14328, runt= 28728msec
slat (usec): min=2, max=4703, avg=10.47, stdev=16.99
clat (usec): min=315, max=1505.6K, avg=3479.55, stdev=8270.22
 lat (usec): min=321, max=1505.6K, avg=3490.40, stdev=8270.14
clat percentiles (usec):
 |  1.00th=[ 1768],  5.00th=[ 2480], 10.00th=[ 2640], 20.00th=[ 2864],
 | 30.00th=[ 2960], 40.00th=[ 3056], 50.00th=[ 3088], 60.00th=[ 3152],
 | 70.00th=[ 3248], 80.00th=[ 3376], 90.00th=[ 3824], 95.00th=[ 4448],
 | 99.00th=[ 8768], 99.50th=[13120], 99.90th=[52992], 99.95th=[103936],
 | 99.99th=[536576]
bw (KB  /s): min= 7148, max=193016, per=100.00%, avg=87866.39, 
stdev=28395.12
  write: io=631998KB, bw=21999KB/s, iops=3590, runt= 28728msec
slat (usec): min=4, max=9301, avg=12.69, stdev=33.41
clat (usec): min=299, max=778312, avg=3871.08, stdev=7378.66
 lat (usec): min=305, max=778320, avg=3884.17, stdev=7378.66
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[3],
 | 30.00th=[4], 40.00th=[4], 50.00th=[4], 60.00th=[4],
 | 70.00th=[4], 80.00th=[4], 90.00th=[5], 95.00th=[7],
 | 99.00th=[   13], 99.50th=[   19], 99.90th=[   55], 99.95th=[  101],
 | 99.99th=[  537]
bw (KB  /s): min= 1524, max=46713, per=100.00%, avg=22089.18, stdev=7184.64
lat (usec) : 500=0.01%, 750=0.03%, 1000=0.06%
lat (msec) : 2=1.29%, 4=88.78%, 10=8.94%, 20=0.56%, 50=0.22%
lat (msec) : 100=0.05%, 250=0.05%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%
  cpu  : usr=8.24%, sys=28.49%, ctx=451227, majf=0, minf=8
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued: total=r=411627/w=103162/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=2454.9MB, aggrb=87501KB/s, minb=87501KB/s, maxb=87501KB/s, 
mint=28728msec, maxt=28728msec
  WRITE: io=631997KB, aggrb=21999KB/s, minb=21999KB/s, maxb=21999KB/s, 
mint=28728msec, maxt=28728msec

Disk stats (read/write):
  sda: ios=407383/102110, merge=123/54, ticks=1413272/456272, in_queue=1869620, 
util=99.71%

-device scsi-generic
iometer: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, 
iodepth=64
fio-2.1.11
Starting 1 process
iometer: Laying out IO file(s) (1 file(s) / 3072MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [64339KB/16908KB/0KB /s] [15.2K/3816/0 
iops] [eta 00m:00s]
iometer: (groupid=0, jobs=1): err= 0: pid=701: Fri Sep 30 01:20:45 2016
  Description  : [Emulation of Intel IOmeter File Server Access Pattern]
  read : io=2454.9MB, bw=88384KB/s, iops=14473, runt= 28441msec
slat (usec): min=5, max=5814, avg=10.86, stdev=21.71
clat (usec): min=459, max=885935, avg=3451.71, stdev=3297.21
 lat (usec): min=526, max=885944, avg=3462.97, stdev=3297.14
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[4],
 | 30.00th=[4], 40.00th=[4], 50.00th=[4], 60.00th=[4],
 | 70.00th=[4], 80.00th=[4], 90.00th=[4], 95.00th=[5],
 | 99.00th=[8], 99.50th=[   11], 99.90th=[   23], 99.95th=[   63],
 | 99.99th=[  153]
bw (KB  /s): min=46295, max=139025, per=100.00%, avg=88833.25, 
stdev=22609.61
  write: io=631998KB, bw=1KB/s, iops=3627, runt= 28441msec
slat (usec): min=6, max=3864, avg=12.96, stdev=24.18
clat (usec): min=582, max=156777, avg=3801.87, stdev=3128.06
 lat (usec): min=610, max=156789, avg=3815.24, stdev=3128.36
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[4],
 | 30.00th=[4], 40.00th=[4], 50.00th=[4], 60.00th=[4],
 | 70.00th=[4], 80.00th=[4], 90.00th=[5], 95.00th=[7],
 | 99.00th=[   11], 99.50th=[   15], 99.90th=[   49], 99.95th=[   74],
 | 99.99th=[  153]
bw (KB  /s): min=11151, max=36378, per=100.00%, avg=22332.46, stdev=5869.71
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.67%, 4=90.61%, 10=8.03%, 

Re: [pve-devel] pve-manager and disk IO monitoring

2016-09-29 Thread Michael Rasmussen
On Thu, 29 Sep 2016 07:38:09 +0200 (CEST)
Alexandre DERUMIER  wrote:

> iostats are coming from qemu.
> 
> what is the output of monitor "info blockstats" for the vm where you don't 
> have stats ?
> 
I have just tested with replacing -device scsi-generic with -device
scsi-block. Machine boots and seems to work and, low and behold I have disk IO 
stats again!
# info blockstats
drive-ide2: rd_bytes=152 wr_bytes=0 rd_operations=4 wr_operations=0 
flush_operations=0 wr_total_time_ns=0 rd_total_time_ns=95326 
flush_total_time_ns=0 rd_merged=0 wr_merged=0 idle_time_ns=258695709709
drive-scsi0: rd_bytes=266729984 wr_bytes=1690120192 rd_operations=15168 
wr_operations=6182 flush_operations=0 wr_total_time_ns=512105318651 
rd_total_time_ns=61640872040 flush_total_time_ns=0 rd_merged=0 wr_merged=0 
idle_time_ns=6506064324

So my question is: Why use scsi-generic instead of scsi-block when
scsi-generic prevents blockstats?

-- 
Hilsen/Regards
Michael Rasmussen

Get my public GnuPG keys:
michael  rasmussen  cc
http://pgp.mit.edu:11371/pks/lookup?op=get=0xD3C9A00E
mir  datanom  net
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE501F51C
mir  miras  org
http://pgp.mit.edu:11371/pks/lookup?op=get=0xE3E80917
--
/usr/games/fortune -es says:
The last thing one knows in constructing a work is what to put first.
-- Blaise Pascal


pgpgsYMEmW_Jp.pgp
Description: OpenPGP digital signature
___
pve-devel mailing list
pve-devel@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel