Re: [pve-devel] [PATCH manager 1/4] ceph: add perf data cache helpers
On 11/7/19 8:49 AM, Dominik Csapak wrote: >>> yes librados and ceph config is available, but that does not mean the >>> cluster is designed so that all nodes can reach the monitor nodes... >>> e.g.: >>> >>> 5 nodes with node0-node2 ceph nodes, node3 a 'compute' node, and >>> node4 is a node in the same cluster, but shares only the 'pve cluster >>> network' with the others, not the ceph or vm network.. this node will >>> never be able to reach the ceph monitors... >>> >> >> You know which nodes hosts ceph. You even can limit this to monitor >> nodes, and do the same there (lowest or highest node-id sends, if none >> of those are quorate the monitor probably isn't either, and even if, >> it simply does not hurt) >> > > this makes sense and is very doable, something like: > > if (quorate && monitor_node) { > // get list of quorate nodes > > // get list of monitor nodes > > // check if i am the lowest/highest quorate monitor node > // if yes, collect/broadcast data > } > > the trade-off remains (this could happen to be a pve quorate node > where the monitor is not quorate), but in a 'normal' cluster > with 3 monitors the chance would be 1/3 (which is ok) and > converges against 1/2 for very many monitors (unlikely) What? you don't have a probability of 0.33 or 0.5 for a fallout... Systems reliability and statistics doesn't work like that. > > but yes as you say below, in such a scenario, the admin has > different problems ;) > > > * having multiple nodes query it, distributes the load between > the nodes, especially when considering my comment on patch 2/4 > where i suggested that we reduce the amount of updates here and > since the pvestatd loops are not synchronized, we get more datapoints > with less rados calls per node makes no sense, you multiply the (cluster traffic) load between nodes not reduce it.. All nodes producing cluster traffic for this is NAK'd by me. >>> >>> i am not really up to speed about the network traffic the current >>> corosync/pmxcfs versions produce, but i would imagine if we have >>> 1 node syncing m datapoints, it should be roughly the same as >>> n nodes syncing m/n datapoints ? we could scale that with the number of >>> nodes for example... >> >> what? if each nodes sync whatever bytes, all nodes send and receive that >> many, >> so you get (n-1) * (n-1) * (payload bytes + overhead) where overhead with >> crypto >> and all is not exactly zero. While a single sender means one (n-1) term >> less, i.e. >> O(n^2) vs. O(n) .. >> Plus, with the monitor nodes are senders one saves "n - (monitor_count)" >> status >> sends too. >> > > i obviously did not convey my idea very well... > yes, as my patch currently is, you are right that we have > (n-1)*(n-1)*size network traffic > > but what i further tried to propose was a mechanic by which > each node only sends the data on the nth iteration of the loop > (where n could be fixed or even dependent on the (quorate) nodecount) > > so that each node only sends 1/n datapoints per pvestatd loop (on average) more complicated and more drawbacks, do not see how that helps... So if a part of the nodes are offline, rebooted, what not, we just get no statistics for some points?? Also you need to coordinate the time when this starts between all of them. and adapt pvestatd scheduling to that.. Really not sure about this. > > > see my above comment, the update calls are (by chance) not done at the > same time becomes obsolete once this is once per cluster, also I normally don't want to have guaranteed-unpredictable time intervals in this sampling. >>> >>> i still see a problem with selecting one node as the source of truth >>> (for above reasons) and in every scenario, we will have (at least some >>> times) not even intervals (think pvestatd updates that take longer, network >>> congestion, nodes leaving/entering the quorate partition, etc.) >> >> you can still have that if you do this per node, so I don't see your >> point. >> >>> >>> also the intervals are not unpredictable (besides my point above) >>> they are just not evenly spaced... >> >> Didn't you just said "not even intervals", so how are they not >> guaranteed unpredictable if, e.g., 16 nodes all sends that stuff.. >> That's for sure not equidistant - a single node having control over >> this is as best it can get. >> > > we mean the same, but express us differently ;) > > the timestamps 1,2,9,11,12,19,21,22,29,... are not equidistant (what i meant > with 'not evenly spaced') but they are also not 'unpredictable' (what i > understand as 'randomly spaced') scheduling is unpredictable, parts of it are even taken in as entropy for the random subsystem, so I'd guess that this holds. Now, that's for one system, adding clustering + network and doing something like you proposed will have very strange effects. E.g., one node hangs a bit so the stats always miss entries dividable through
[pve-devel] [PATCH kernel-meta] efiboot/autorm functions: ignore running kernel if it was removed
In the case were someone removes the current kernel we do not can "keep" it anymore. While this was obviously no issue for the autoremoval logic, it is an issue for the pve-efiboot-tool refresh command, which reuses this helper to see which kernels it needs to keep on the ESP. Without this a running kernel was never removed from the EFI System Partitions if de-installed from a host, so if it sorted as newest one it was then booted again, which naturally confuses users (it was just removed!!). So to ensure that we cannot get such zombie kernels ensure that only installed kernels are included in the list. Signed-off-by: Thomas Lamprecht --- still not enough to make it fully work, as a previous fix (initrd entry removal from /proc/cmdline) doesn't work with `mawk` AWK (the default) but only with `gawk` as it uses GNU AWK extensions efiboot/functions | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/efiboot/functions b/efiboot/functions index a179713..b804fb9 100755 --- a/efiboot/functions +++ b/efiboot/functions @@ -14,7 +14,7 @@ PMX_LOADER_CONF="loader/loader.conf" # debian's apt package: # # Mark as not-for-autoremoval those kernel packages that are: -# - the currently booted version +# - the currently booted version, if still installed # - the kernel version we've been called for # - the latest kernel version (as determined by debian version number) # - the second-latest kernel version @@ -37,6 +37,11 @@ kernel_keep_versions() { # ignore the currently running version if attempting a reproducible build if [ -n "${SOURCE_DATE_EPOCH}" ]; then running_version="" + elif [ ! -e "/boot/vmlinuz-$running_version" ]; then + # ignore the current version if it got removed, the "auto-remove" logic + # will not be affected, because either it is installed and thus we keep + # it in the list, or it's already removed anyway + running_version="" fi latest_2_versions="$(echo "$sorted_list" | grep -E '^[^ ]+-pve' | head -n2 )" -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH v3 kernel-meta] fix #2403: exclude initrd entries from /proc/cmdline
On 10/16/19 1:17 PM, Oguz Bektas wrote: > if we fallback to /proc/cmdline, it can include the booted initrd. > > to avoid loader entries with initrd 'options' lines, we have to parse > them out. > > Signed-off-by: Oguz Bektas > --- > > v2->v3: > * match forward slashes > * match underscore > * match zero or more whitespace at the end > > efiboot/zz-pve-efiboot | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/efiboot/zz-pve-efiboot b/efiboot/zz-pve-efiboot > index 4756555..8771da9 100755 > --- a/efiboot/zz-pve-efiboot > +++ b/efiboot/zz-pve-efiboot > @@ -50,7 +50,8 @@ update_esps() { > CMDLINE="$(cat /etc/kernel/cmdline)" > else > warn "No /etc/kernel/cmdline found - falling back to > /proc/cmdline" > - CMDLINE="$(cat /proc/cmdline)" > + # remove initrd entries > + CMDLINE="$(awk '{gsub(/\yinitrd=([0-9a-zA-Z\/\\._-])*\s*/,x)}1' > /proc/cmdline)" sooo, this does not works at all with the default installed mawk... Only with gnu awk, which one may get fast installed on a developer workstation... :/ So we all did not test clean environments.. it seems mawk has no word boundary regexp[0], soo either we depend on gawk or we find a way were it works with mawk.. [0]: https://mail-index.netbsd.org/tech-userlevel/2012/12/02/msg006954.html > fi > > loop_esp_list update_esp_func > ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] applied: [PATCH manager 1/2] status/graphite: just use setsockopt to set timeouts
after rethinking this it felt weird, sockets already can to this themself, so I checked out the IO::Socket::Timeout module, and yeah, it's just a OOP wrapper for this, hiding the "scary" struct pack. So instead of adding that as dependency lets do it ourself. Signed-off-by: Thomas Lamprecht Cc: Martin Verges --- PVE/Status/Graphite.pm | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/PVE/Status/Graphite.pm b/PVE/Status/Graphite.pm index fdd4ff4c..8afadb53 100644 --- a/PVE/Status/Graphite.pm +++ b/PVE/Status/Graphite.pm @@ -4,7 +4,7 @@ use strict; use warnings; use IO::Socket::IP; -use IO::Socket::Timeout; +use Socket qw(SOL_SOCKET SO_SNDTIMEO SO_RCVTIMEO); use PVE::Status::Plugin; use PVE::JSONSchema; @@ -105,9 +105,10 @@ sub write_graphite_hash { ) || die "couldn't create carbon socket [$host]:$port - $@\n"; if ( $proto eq 'tcp' ) { - IO::Socket::Timeout->enable_timeouts_on($carbon_socket); - $carbon_socket->read_timeout($timeout); - $carbon_socket->write_timeout($timeout); + # seconds and µs + my $timeout_struct = pack( 'l!l!', $timeout, 0); + setsockopt($carbon_socket, SOL_SOCKET, SO_SNDTIMEO, $timeout_struct); + setsockopt($carbon_socket, SOL_SOCKET, SO_RCVTIMEO, $timeout_struct); } write_graphite($carbon_socket, $d, $ctime, $path.".$object"); -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] applied: [PATCH manager 2/2] status/graphite: refactor write_graphite to send all at once
Instead of doing multiple sends, for each status metric line one, assemble it all in a string and send it out in a single go. Per VM/CT/Node we had >10 lines to send, so this is quite the reduction. But, also note that thanks to Nagler's delay algorithm this may not had a big effect for TCP, as it buffered those small writes anyhow. For UDP it can reduce the packet count on the line dramatically, though. Signed-off-by: Thomas Lamprecht --- note, this improves things sligthly, but our current usager of the plugin modules in pvestatd is so full of overhead that it does not makes it good. we create a new connection for each VM/CT/NODE/Storage, for each! and that on every update loop iteration newly. While UDP doesn't even cares to much about that, TCP really does, and it shows. So there'll be some other patches comming to get this a bit perfomanter.. PVE/Status/Graphite.pm | 52 ++ 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/PVE/Status/Graphite.pm b/PVE/Status/Graphite.pm index 8afadb53..06168cea 100644 --- a/PVE/Status/Graphite.pm +++ b/PVE/Status/Graphite.pm @@ -55,14 +55,6 @@ sub options { }; } -# we do not want boolean/state information to export to graphite -my $key_blacklist = { -'template' => 1, -'pid' => 1, -'agent' => 1, -'serial' => 1, -}; - # Plugin implementation sub update_node_status { my ($class, $plugin_config, $node, $data, $ctime) = @_; @@ -119,25 +111,35 @@ sub write_graphite_hash { sub write_graphite { my ($carbon_socket, $d, $ctime, $path) = @_; -for my $key (keys %$d) { - - my $value = $d->{$key}; - my $oldpath = $path; - $key =~ s/\./-/g; - $path .= ".$key"; - - if ( defined $value ) { - if ( ref $value eq 'HASH' ) { - write_graphite($carbon_socket, $value, $ctime, $path); - } elsif ($value =~ m/^[+-]?[0-9]*\.?[0-9]+$/ && - !$key_blacklist->{$key}) { - $carbon_socket->send( "$path $value $ctime\n" ); - } else { - # do not send blacklisted or non-numeric values +# we do not want boolean/state information to export to graphite +my $key_blacklist = { + 'template' => 1, + 'pid' => 1, + 'agent' => 1, + 'serial' => 1, +}; + +my $graphite_data = ''; +my $assemble_graphite_data; +$assemble_graphite_data = sub { + my ($metric, $path) = @_; + + for my $key (sort keys %$metric) { + my $value = $d->{$key} // next; + + $key =~ s/\./-/g; + my $metricpath = $path . ".$key"; + + if (ref($value) eq 'HASH') { + $assemble_graphite_data->($value, $metricpath); + } elsif ($value =~ m/^[+-]?[0-9]*\.?[0-9]+$/ && !$key_blacklist->{$key}) { + $graphite_data .= "$metricpath $value $ctime\n"; } } - $path = $oldpath; -} +}; +$assemble_graphite_data->($d, $path); + +$carbon_socket->send($graphite_data) if $graphite_data ne ''; } PVE::JSONSchema::register_format('graphite-path', \_verify_graphite_path); -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] applied: [PATCH docs] Fix: pveceph: spelling in section Trim/Discard
On 11/7/19 1:42 PM, Alwin Antreich wrote: > Signed-off-by: Alwin Antreich > --- > pveceph.adoc | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/pveceph.adoc b/pveceph.adoc > index 99c610a..122f063 100644 > --- a/pveceph.adoc > +++ b/pveceph.adoc > @@ -762,9 +762,9 @@ Trim/Discard > > It is a good measure to run 'fstrim' (discard) regularly on VMs or > containers. > This releases data blocks that the filesystem isn’t using anymore. It reduces > -data usage and the resource load. Most modern operating systems issue such > -discard commands to their disks regurarly. You only need to ensure that the > -Virtual Machines enable the xref:qm_hard_disk_discard[disk discard option]. > +data usage and resource load. Most modern operating systems issue such > discard > +commands to their disks regularly. You only need to ensure that the Virtual > +Machines enable the xref:qm_hard_disk_discard[disk discard option]. > > [[pveceph_scrub]] > Scrub & Deep Scrub > applied, one typo from you and one from me, nicely balanced ;) ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH docs] Fix: pveceph: spelling in section Trim/Discard
Signed-off-by: Alwin Antreich --- pveceph.adoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pveceph.adoc b/pveceph.adoc index 99c610a..122f063 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -762,9 +762,9 @@ Trim/Discard It is a good measure to run 'fstrim' (discard) regularly on VMs or containers. This releases data blocks that the filesystem isn’t using anymore. It reduces -data usage and the resource load. Most modern operating systems issue such -discard commands to their disks regurarly. You only need to ensure that the -Virtual Machines enable the xref:qm_hard_disk_discard[disk discard option]. +data usage and resource load. Most modern operating systems issue such discard +commands to their disks regularly. You only need to ensure that the Virtual +Machines enable the xref:qm_hard_disk_discard[disk discard option]. [[pveceph_scrub]] Scrub & Deep Scrub -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] applied: [PATCH docs] Fix: pveceph: broken ref anchor pveceph_mgr_create
On 11/7/19 1:35 PM, Alwin Antreich wrote: > Signed-off-by: Alwin Antreich > --- > pveceph.adoc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > applied, thanks! ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH docs] Fix: pveceph: broken ref anchor pveceph_mgr_create
Signed-off-by: Alwin Antreich --- pveceph.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pveceph.adoc b/pveceph.adoc index ef257ac..99c610a 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -286,7 +286,7 @@ monitor the cluster. Since the Ceph luminous release at least one ceph-mgr footnote:[Ceph Manager https://docs.ceph.com/docs/{ceph_codename}/mgr/] daemon is required. -[i[pveceph_create_mgr]] +[[pveceph_create_mgr]] Create Manager ~~ -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] applied: [PATCH pve-access-control] ldap auth: add sslversion option
>>Would it make sense to switch to that as default? To you know how >>good its supported? For now, keep tls 1.2. tls 1.3 is really new and far to be implemented everywhere. (I don't think it's implemented in AD 2019) - Mail original - De: "Thomas Lamprecht" À: "pve-devel" , "aderumier" Envoyé: Mercredi 6 Novembre 2019 20:29:30 Objet: applied: [pve-devel] [PATCH pve-access-control] ldap auth: add sslversion option On 11/4/19 10:18 AM, Alexandre Derumier wrote: > default to tls1.2 > While https://metacpan.org/pod/distribution/perl-ldap/lib/Net/LDAP.pod#sslversion ony lists ['sslv2' | 'sslv3' | 'sslv23' | 'tlsv1' | 'tlsv1_1' | 'tlsv1_2'] it says that this is just passed to IO::Socket::SSL and https://metacpan.org/pod/IO::Socket::SSL#SSL_version also accepts TLS 1.3, so I added that as followup.. Would it make sense to switch to that as default? To you know how good its supported? > Signed-off-by: Alexandre Derumier > --- > PVE/Auth/AD.pm | 11 +++ > PVE/Auth/LDAP.pm | 5 + > 2 files changed, 16 insertions(+) > > diff --git a/PVE/Auth/AD.pm b/PVE/Auth/AD.pm > index b924b02..a877a76 100755 > --- a/PVE/Auth/AD.pm > +++ b/PVE/Auth/AD.pm > @@ -33,6 +33,12 @@ sub properties { > optional => 1, > > }, > + sslversion => { > + description => "LDAPS ssl version.", > + type => 'string', > + enum => [qw(tlsv1 tlsv1_1 tlsv1_2)], > + optional => 1, > + }, > default => { > description => "Use this as default realm", > type => 'boolean', > @@ -69,6 +75,7 @@ sub options { > domain => {}, > port => { optional => 1 }, > secure => { optional => 1 }, > + sslversion => { optional => 1 }, > default => { optional => 1 },, > comment => { optional => 1 }, > tfa => { optional => 1 }, > @@ -108,6 +115,10 @@ my $authenticate_user_ad = sub { > $ad_args{verify} = 'none'; > } > > + if ($config->{secure}) { > + $ad_args{sslversion} = $config->{sslversion} ? $config->{sslversion} : > 'tlsv1_2'; > + } > + > my $ldap = Net::LDAP->new($conn_string, %ad_args) || die "$@\n"; > > $username = "$username\@$config->{domain}" > diff --git a/PVE/Auth/LDAP.pm b/PVE/Auth/LDAP.pm > index 9f08504..d6c26eb 100755 > --- a/PVE/Auth/LDAP.pm > +++ b/PVE/Auth/LDAP.pm > @@ -70,6 +70,7 @@ sub options { > user_attr => {}, > port => { optional => 1 }, > secure => { optional => 1 }, > + sslversion => { optional => 1 }, > default => { optional => 1 }, > comment => { optional => 1 }, > tfa => { optional => 1 }, > @@ -109,6 +110,10 @@ my $authenticate_user_ldap = sub { > $ldap_args{verify} = 'none'; > } > > + if ($config->{secure}) { > + $ldap_args{sslversion} = $config->{sslversion} ? $config->{sslversion} : > 'tlsv1_2'; > + } > + > my $ldap = Net::LDAP->new($conn_string, %ldap_args) || die "$@\n"; > > if (my $bind_dn = $config->{bind_dn}) { > ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH qemu-server] restore_tar_archive: Add skiplock to destroy_vm
When calling qmrestore a config file is created and locked with a lock property. The following destroy_vm has been impossible as skiplock has not been set. Signed-off-by: Dominic Jäger --- PVE/QemuServer.pm | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/PVE/QemuServer.pm b/PVE/QemuServer.pm index bfe6662..345ffab 100644 --- a/PVE/QemuServer.pm +++ b/PVE/QemuServer.pm @@ -6604,9 +6604,14 @@ sub restore_tar_archive { my $storecfg = PVE::Storage::config(); -# destroy existing data - keep empty config +# Restoring a backup can replace an existing VM. In this case, we need to +# remove existing data such as disks as they would pile up as unused disks +# in the new VM otherwise. +# keep_empty_config=1 to prevent races until overwriting it with the +# restored config. +# skiplock=1 because qmrestore has set the lock itself. my $vmcfgfn = PVE::QemuConfig->config_file($vmid); -destroy_vm($storecfg, $vmid, 1) if -f $vmcfgfn; +destroy_vm($storecfg, $vmid, 1, 1) if -f $vmcfgfn; my $tocmd = "/usr/lib/qemu-server/qmextract"; -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH storage] fix #2085: Handle non-default mount point in path() by introducing new mountpoint property
On November 7, 2019 12:52 pm, Fabian Ebner wrote: > On 11/7/19 9:34 AM, Fabian Grünbichler wrote: >> On November 6, 2019 1:46 pm, Fabian Ebner wrote: >>> A new mountpoint property is added to the schema for ZFSPool storages. >>> When needed for the first time, the current mount point is determined and >>> written to the storage config. >>> >>> Signed-off-by: Fabian Ebner >>> --- >>> PVE/Storage/ZFSPoolPlugin.pm | 25 +++-- >>> 1 file changed, 23 insertions(+), 2 deletions(-) >>> >>> diff --git a/PVE/Storage/ZFSPoolPlugin.pm b/PVE/Storage/ZFSPoolPlugin.pm >>> index 16fb0d6..44c8123 100644 >>> --- a/PVE/Storage/ZFSPoolPlugin.pm >>> +++ b/PVE/Storage/ZFSPoolPlugin.pm >>> @@ -32,6 +32,10 @@ sub properties { >>> description => "use sparse volumes", >>> type => 'boolean', >>> }, >>> + mountpoint => { >>> + description => "mount point", >>> + type => 'string', format => 'pve-storage-path', >>> + }, >>> }; >>> } >>> >>> @@ -44,6 +48,7 @@ sub options { >>> disable => { optional => 1 }, >>> content => { optional => 1 }, >>> bwlimit => { optional => 1 }, >>> + mountpoint => { optional => 1 }, >>> }; >>> } >>> >>> @@ -148,11 +153,27 @@ sub path { >>> my ($vtype, $name, $vmid) = $class->parse_volname($volname); >>> >>> my $path = ''; >>> +my $mountpoint = $scfg->{mountpoint}; >>> + >>> +# automatically write mountpoint to storage config if it is not present >>> +if (!$mountpoint) { >>> + $mountpoint = $class->zfs_request($scfg, undef, 'get', '-o', 'value', >>> + 'mountpoint', '-H', $scfg->{pool}); >>> + chomp($mountpoint); >>> + >>> + eval { >>> + PVE::Storage::lock_storage_config(sub { >>> + my $cfg = PVE::Storage::config(); >>> + $cfg->{ids}->{$storeid}->{mountpoint} = $mountpoint; >>> + PVE::Storage::write_config($cfg); >>> + }, "update storage config failed"); >>> + }; >>> + warn $@ if $@; >>> +} >>> >>> if ($vtype eq "images") { >>> if ($name =~ m/^subvol-/ || $name =~ m/^basevol-/) { >>> - # fixme: we currently assume standard mount point?! >>> - $path = "/$scfg->{pool}/$name"; >>> + $path = "$mountpoint/$name"; >>> } else { >>> $path = "/dev/zvol/$scfg->{pool}/$name"; >>> } >> >> it would be great if this "get mountpoint property of dataset" (or better: >> even "get arbitrary properties of dataset") helper could be factored out. >> we already have two calls to "zfs get -Hp '...' $dataset" that could >> re-use such a helper. >> >> then we could add a single check in activate_storage (maybe just for the >> branch where we actually just imported the pool?) to verify that the >> stored mountpoint value is correct, and if not update it (or warn about >> the problem). that check could have a very low timeout, and errors could >> be ignored since it is just a double-check. >> >> ___ >> pve-devel mailing list >> pve-devel@pve.proxmox.com >> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >> > > I noticed that the $scfg parameter in zfs_request is never used. Should > I clean that up? Otherwise I'd need to include the $scfg parameter in > the helper function as well, where it also isn't needed. it's used by ZFS over iSCSI, where the plugin is derived from ZFSPoolPlugin, so it's not safe to remove ;) ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH storage] fix #2085: Handle non-default mount point in path() by introducing new mountpoint property
On 11/7/19 9:34 AM, Fabian Grünbichler wrote: On November 6, 2019 1:46 pm, Fabian Ebner wrote: A new mountpoint property is added to the schema for ZFSPool storages. When needed for the first time, the current mount point is determined and written to the storage config. Signed-off-by: Fabian Ebner --- PVE/Storage/ZFSPoolPlugin.pm | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/PVE/Storage/ZFSPoolPlugin.pm b/PVE/Storage/ZFSPoolPlugin.pm index 16fb0d6..44c8123 100644 --- a/PVE/Storage/ZFSPoolPlugin.pm +++ b/PVE/Storage/ZFSPoolPlugin.pm @@ -32,6 +32,10 @@ sub properties { description => "use sparse volumes", type => 'boolean', }, + mountpoint => { + description => "mount point", + type => 'string', format => 'pve-storage-path', + }, }; } @@ -44,6 +48,7 @@ sub options { disable => { optional => 1 }, content => { optional => 1 }, bwlimit => { optional => 1 }, + mountpoint => { optional => 1 }, }; } @@ -148,11 +153,27 @@ sub path { my ($vtype, $name, $vmid) = $class->parse_volname($volname); my $path = ''; +my $mountpoint = $scfg->{mountpoint}; + +# automatically write mountpoint to storage config if it is not present +if (!$mountpoint) { + $mountpoint = $class->zfs_request($scfg, undef, 'get', '-o', 'value', + 'mountpoint', '-H', $scfg->{pool}); + chomp($mountpoint); + + eval { + PVE::Storage::lock_storage_config(sub { + my $cfg = PVE::Storage::config(); + $cfg->{ids}->{$storeid}->{mountpoint} = $mountpoint; + PVE::Storage::write_config($cfg); + }, "update storage config failed"); + }; + warn $@ if $@; +} if ($vtype eq "images") { if ($name =~ m/^subvol-/ || $name =~ m/^basevol-/) { - # fixme: we currently assume standard mount point?! - $path = "/$scfg->{pool}/$name"; + $path = "$mountpoint/$name"; } else { $path = "/dev/zvol/$scfg->{pool}/$name"; } it would be great if this "get mountpoint property of dataset" (or better: even "get arbitrary properties of dataset") helper could be factored out. we already have two calls to "zfs get -Hp '...' $dataset" that could re-use such a helper. then we could add a single check in activate_storage (maybe just for the branch where we actually just imported the pool?) to verify that the stored mountpoint value is correct, and if not update it (or warn about the problem). that check could have a very low timeout, and errors could be ignored since it is just a double-check. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel I noticed that the $scfg parameter in zfs_request is never used. Should I clean that up? Otherwise I'd need to include the $scfg parameter in the helper function as well, where it also isn't needed. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] applied: [PATCH] fix #2456 setting bind-mount through API/CLI is broken
On 11/7/19 7:51 AM, Wolfgang Link wrote: > Content-type check is only valid for mp from type 'volume'. > The Content-type check is correct for rootfs and mount points. > --- > src/PVE/LXC/Config.pm | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/src/PVE/LXC/Config.pm b/src/PVE/LXC/Config.pm > index 7e51b8d..39de691 100644 > --- a/src/PVE/LXC/Config.pm > +++ b/src/PVE/LXC/Config.pm > @@ -937,7 +937,7 @@ sub update_pct_config { > if ($opt =~ m/^mp(\d+)$/ || $opt eq 'rootfs') { > $class->check_protection($conf, "can't update CT $vmid drive > '$opt'"); > my $mp = $opt eq 'rootfs' ? $class->parse_ct_rootfs($value) : > $class->parse_ct_mountpoint($value); > - $check_content_type->($mp); > + $check_content_type->($mp) if ($mp->{type} eq 'volume'); > } elsif ($opt eq 'hookscript') { > PVE::GuestHelpers::check_hookscript($value); > } elsif ($opt eq 'nameserver') { > applied, thanks! ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH v3 docs] Add section for ZFS Special Device
Signed-off-by: Fabian Ebner --- Changes from v2: * Better example of when a special device is useful * Don't mention special_small_blocks property in the first section, so it is explained right when we use it for the first time * Explain possible values for size right away * Use a concrete value for size in the examples local-zfs.adoc | 53 ++ 1 file changed, 53 insertions(+) diff --git a/local-zfs.adoc b/local-zfs.adoc index b4fb7db..f4db54b 100644 --- a/local-zfs.adoc +++ b/local-zfs.adoc @@ -431,3 +431,56 @@ See the `encryptionroot`, `encryption`, `keylocation`, `keyformat` and `keystatus` properties, the `zfs load-key`, `zfs unload-key` and `zfs change-key` commands and the `Encryption` section from `man zfs` for more details and advanced usage. + + +ZFS Special Device +~~ + +Since version 0.8.0 ZFS supports `special` devices. A `special` device in a +pool is used to store metadata, deduplication tables, and optionally small +file blocks. + +A `special` device can improve the speed of a pool consisting of slow spinning +hard disks with a lot of changing metadata. For example workloads that involve +creating or deleting a large number of files will benefit from the presence of +a `special` device. ZFS datasets can be configured to store whole small files +on the `special` device which can further improve the performance. Use SSDs for +the `special` device. + +IMPORTANT: The redundancy of the `special` device should match the one of the +pool, since the `special` device is a point of failure for the whole pool. + +WARNING: Adding a `special` device to a pool cannot be undone! + +.Create a pool with `special` device and RAID-1: + + zpool create -f -o ashift=12 mirror special mirror + +.Add a `special` device to an existing pool with RAID-1: + + zpool add special mirror + +ZFS datasets expose the `special_small_blocks=` property. `size` can be +`0` to disable storing small file blocks on the `special` device or a power of +two in the range between `512B` to `128K`. After setting the property new file +blocks smaller than `size` will be allocated on the `special` device. + +IMPORTANT: If the value for `special_small_blocks` is greater than or equal to +the `recordsize` of the dataset, *all* data will be written to the `special` +device, so be careful! + +Setting the `special_small_blocks` property on a pool will change the default +value of that property for all child ZFS datasets (for example all containers +in the pool will opt in for small file blocks). + +.Opt in for small file blocks pool-wide: + + zfs set special_small_blocks=4K + +.Opt in for small file blocks for a single dataset: + + zfs set special_small_blocks=4K / + +.Opt out from small file blocks for a single dataset: + + zfs set special_small_blocks=0 / -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH v2 docs] Add section for ZFS Special Device
Thanks for the suggestions, I'll do a v3. On 11/6/19 8:40 PM, Thomas Lamprecht wrote: On 11/6/19 10:46 AM, Fabian Ebner wrote: Signed-off-by: Fabian Ebner --- Changes from v1: * Reworded the part that describes when a special device is useful * Moved that part to the top, so people know right away whether it's interesting for them * Add note about values for the special_small_blocks parameter * Split a long sentence local-zfs.adoc | 49 + 1 file changed, 49 insertions(+) diff --git a/local-zfs.adoc b/local-zfs.adoc index b4fb7db..ba211ef 100644 --- a/local-zfs.adoc +++ b/local-zfs.adoc @@ -431,3 +431,52 @@ See the `encryptionroot`, `encryption`, `keylocation`, `keyformat` and `keystatus` properties, the `zfs load-key`, `zfs unload-key` and `zfs change-key` commands and the `Encryption` section from `man zfs` for more details and advanced usage. + + +ZFS Special Device +~~ + +Since version 0.8.0 ZFS supports `special` devices. A `special` device in a +pool is used to store metadata, deduplication tables, and optionally small +file blocks. + +A `special` device can improve the speed of a pool consisting of slow spinning +hard disks with a lot of changing metadata. For example if the pool has many +short-lived files. Enabling `special_small_blocks` can further increase the When does it has short-lived files? Can you be a bit more exact? I mean ZFS cannot guess if the files are short-lifed so a user could wonder what is meant at all here.. Maybe "...workloads that involve creating or deleting a large number of files..." instead of talking about "short-lived files". +performance when those files are small. Use SSDs for the `special` device. "when those files are small." sounds really weird, IMO. maybe "can further increase the performance for whole small files" (with reference to below to tell the reader that it's explained better there? Ok, I'll think of something. + +IMPORTANT: The redundancy of the `special` device should match the one of the +pool, since the `special` device is a point of failure for the whole pool. + +WARNING: Adding a `special` device to a pool cannot be undone! + +.Create a pool with `special` device and RAID-1: + + zpool create -f -o ashift=12 mirror special mirror + +.Add a `special` device to an existing pool with RAID-1: + + zpool add special mirror + +For ZFS datasets where the `special_small_blocks` property is set to a non-zero +value, the `special` device is used to store small file blocks up to that size. up to that size .. in [bytes,kbytes,...?] +Setting the `special_small_blocks` property on the pool will change the default +value of that property for all child ZFS datasets (for example all containers +in the pool will opt in for small file blocks). + +.Opt in for small file blocks pool-wide: + + zfs set special_small_blocks= + +.Opt in for small file blocks for a single dataset: + + zfs set special_small_blocks= / + +.Opt out from small file blocks for a single dataset: + + zfs set special_small_blocks=0 / + +NOTE: The value for `size` can be `0` to disable storing small file blocks on +the special device or a power of two in the range between `512B` to `128K`. +If the value for `size` is greater than or equal to the `recordsize` of the +dataset, all data will be written to the `special` device, so be careful! I've still no idea how to pass special_small_blocks, written out? I'd mention that somewhere, best with a concrete example. I'll use concrete values in the examples. And it might also be better to explain the parameter first (move the NOTE up) and then state the examples. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH storage] fix #2085: Handle non-default mount point in path() by introducing new mountpoint property
On 11/7/19 9:34 AM, Fabian Grünbichler wrote: On November 6, 2019 1:46 pm, Fabian Ebner wrote: A new mountpoint property is added to the schema for ZFSPool storages. When needed for the first time, the current mount point is determined and written to the storage config. Signed-off-by: Fabian Ebner --- PVE/Storage/ZFSPoolPlugin.pm | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/PVE/Storage/ZFSPoolPlugin.pm b/PVE/Storage/ZFSPoolPlugin.pm index 16fb0d6..44c8123 100644 --- a/PVE/Storage/ZFSPoolPlugin.pm +++ b/PVE/Storage/ZFSPoolPlugin.pm @@ -32,6 +32,10 @@ sub properties { description => "use sparse volumes", type => 'boolean', }, + mountpoint => { + description => "mount point", + type => 'string', format => 'pve-storage-path', + }, }; } @@ -44,6 +48,7 @@ sub options { disable => { optional => 1 }, content => { optional => 1 }, bwlimit => { optional => 1 }, + mountpoint => { optional => 1 }, }; } @@ -148,11 +153,27 @@ sub path { my ($vtype, $name, $vmid) = $class->parse_volname($volname); my $path = ''; +my $mountpoint = $scfg->{mountpoint}; + +# automatically write mountpoint to storage config if it is not present +if (!$mountpoint) { + $mountpoint = $class->zfs_request($scfg, undef, 'get', '-o', 'value', + 'mountpoint', '-H', $scfg->{pool}); + chomp($mountpoint); + + eval { + PVE::Storage::lock_storage_config(sub { + my $cfg = PVE::Storage::config(); + $cfg->{ids}->{$storeid}->{mountpoint} = $mountpoint; + PVE::Storage::write_config($cfg); + }, "update storage config failed"); + }; + warn $@ if $@; +} if ($vtype eq "images") { if ($name =~ m/^subvol-/ || $name =~ m/^basevol-/) { - # fixme: we currently assume standard mount point?! - $path = "/$scfg->{pool}/$name"; + $path = "$mountpoint/$name"; } else { $path = "/dev/zvol/$scfg->{pool}/$name"; } it would be great if this "get mountpoint property of dataset" (or better: even "get arbitrary properties of dataset") helper could be factored out. we already have two calls to "zfs get -Hp '...' $dataset" that could re-use such a helper. then we could add a single check in activate_storage (maybe just for the branch where we actually just imported the pool?) to verify that the stored mountpoint value is correct, and if not update it (or warn about the problem). that check could have a very low timeout, and errors could be ignored since it is just a double-check. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel Yes, that does sound like a better place to do it. I'll work on a v3 with your suggestions, so just ignore the v2. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Re: [pve-devel] [PATCH storage] fix #2085: Handle non-default mount point in path() by introducing new mountpoint property
On November 6, 2019 1:46 pm, Fabian Ebner wrote: > A new mountpoint property is added to the schema for ZFSPool storages. > When needed for the first time, the current mount point is determined and > written to the storage config. > > Signed-off-by: Fabian Ebner > --- > PVE/Storage/ZFSPoolPlugin.pm | 25 +++-- > 1 file changed, 23 insertions(+), 2 deletions(-) > > diff --git a/PVE/Storage/ZFSPoolPlugin.pm b/PVE/Storage/ZFSPoolPlugin.pm > index 16fb0d6..44c8123 100644 > --- a/PVE/Storage/ZFSPoolPlugin.pm > +++ b/PVE/Storage/ZFSPoolPlugin.pm > @@ -32,6 +32,10 @@ sub properties { > description => "use sparse volumes", > type => 'boolean', > }, > + mountpoint => { > + description => "mount point", > + type => 'string', format => 'pve-storage-path', > + }, > }; > } > > @@ -44,6 +48,7 @@ sub options { > disable => { optional => 1 }, > content => { optional => 1 }, > bwlimit => { optional => 1 }, > + mountpoint => { optional => 1 }, > }; > } > > @@ -148,11 +153,27 @@ sub path { > my ($vtype, $name, $vmid) = $class->parse_volname($volname); > > my $path = ''; > +my $mountpoint = $scfg->{mountpoint}; > + > +# automatically write mountpoint to storage config if it is not present > +if (!$mountpoint) { > + $mountpoint = $class->zfs_request($scfg, undef, 'get', '-o', 'value', > + 'mountpoint', '-H', $scfg->{pool}); > + chomp($mountpoint); > + > + eval { > + PVE::Storage::lock_storage_config(sub { > + my $cfg = PVE::Storage::config(); > + $cfg->{ids}->{$storeid}->{mountpoint} = $mountpoint; > + PVE::Storage::write_config($cfg); > + }, "update storage config failed"); > + }; > + warn $@ if $@; > +} > > if ($vtype eq "images") { > if ($name =~ m/^subvol-/ || $name =~ m/^basevol-/) { > - # fixme: we currently assume standard mount point?! > - $path = "/$scfg->{pool}/$name"; > + $path = "$mountpoint/$name"; > } else { > $path = "/dev/zvol/$scfg->{pool}/$name"; > } it would be great if this "get mountpoint property of dataset" (or better: even "get arbitrary properties of dataset") helper could be factored out. we already have two calls to "zfs get -Hp '...' $dataset" that could re-use such a helper. then we could add a single check in activate_storage (maybe just for the branch where we actually just imported the pool?) to verify that the stored mountpoint value is correct, and if not update it (or warn about the problem). that check could have a very low timeout, and errors could be ignored since it is just a double-check. ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH v2 docs] Add description for mountpoint property
Signed-off-by: Fabian Ebner --- pve-storage-zfspool.adoc | 5 + 1 file changed, 5 insertions(+) diff --git a/pve-storage-zfspool.adoc b/pve-storage-zfspool.adoc index f53a598..cf7cc3f 100644 --- a/pve-storage-zfspool.adoc +++ b/pve-storage-zfspool.adoc @@ -32,6 +32,11 @@ sparse:: Use ZFS thin-provisioning. A sparse volume is a volume whose reservation is not equal to the volume size. +mountpoint:: + +This property is added automatically and only needs to be updated +when the mount point of your ZFS pool/filesystem changes. + .Configuration Example (`/etc/pve/storage.cfg`) zfspool: vmdata -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
[pve-devel] [PATCH v2 storage] fix #2085: Handle non-default mount point in path() by introducing new mountpoint property
A new mountpoint property is added to the schema for ZFSPool storages. When needed for the first time, the current mount point is determined and written to the storage config. Signed-off-by: Fabian Ebner --- Changes from v1: * expanded eval around the zfs_request * check if the returned value is valid PVE/Storage/ZFSPoolPlugin.pm | 29 +++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/PVE/Storage/ZFSPoolPlugin.pm b/PVE/Storage/ZFSPoolPlugin.pm index 16fb0d6..7a81b17 100644 --- a/PVE/Storage/ZFSPoolPlugin.pm +++ b/PVE/Storage/ZFSPoolPlugin.pm @@ -32,6 +32,10 @@ sub properties { description => "use sparse volumes", type => 'boolean', }, + mountpoint => { + description => "mount point", + type => 'string', format => 'pve-storage-path', + }, }; } @@ -44,6 +48,7 @@ sub options { disable => { optional => 1 }, content => { optional => 1 }, bwlimit => { optional => 1 }, + mountpoint => { optional => 1 }, }; } @@ -148,11 +153,31 @@ sub path { my ($vtype, $name, $vmid) = $class->parse_volname($volname); my $path = ''; +my $mountpoint = $scfg->{mountpoint}; + +# automatically write mountpoint to storage config if it is not present +if (!$mountpoint) { + eval { + $mountpoint = $class->zfs_request($scfg, undef, 'get', '-o', 'value', + 'mountpoint', '-H', $scfg->{pool}); + chomp($mountpoint); + PVE::JSONSchema::check_format(properties()->{mountpoint}->{format}, $mountpoint); + + PVE::Storage::lock_storage_config(sub { + my $cfg = PVE::Storage::config(); + $cfg->{ids}->{$storeid}->{mountpoint} = $mountpoint; + PVE::Storage::write_config($cfg); + }, "update storage config failed"); + }; + if (my $err = $@) { + warn "determining mount point failed: " . $err . " - using default mount point instead\n"; + $mountpoint = "/$scfg->{pool}"; + } +} if ($vtype eq "images") { if ($name =~ m/^subvol-/ || $name =~ m/^basevol-/) { - # fixme: we currently assume standard mount point?! - $path = "/$scfg->{pool}/$name"; + $path = "$mountpoint/$name"; } else { $path = "/dev/zvol/$scfg->{pool}/$name"; } -- 2.20.1 ___ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel