applied with fixes - see comment inline: > On April 13, 2018 at 12:24 PM Wolfgang Link <w.l...@proxmox.com> wrote: > > > In certain high-load scenarios ANY ZFS operation can block, > including registering an (async) destroy. > Since ZFS operations are implemented via ioctl's, > killing the user space process > does not affect the waiting kernel thread processing the ioctl. > > Once "zfs destroy" has been called, killing it does not say anything > about whether the destroy operation will be aborted or not. > Since running into a timeout effectively means killing it, > we don't know whether the snapshot exists afterwards or not. > We also don't know how long it takes for ZFS to catch up on pending ioctls. > > Given the above problem, we must to not die on errors when deleting a no > longer needed snapshot fails (under a timeout) after an otherwise > successful replication. Since we retry on the next run anyway, this is > not problematic. > > The snapshot deletion error will be logged in the replication log > and the syslog/journal. > --- > PVE/Replication.pm | 30 ++++++++++++++++++++++++++---- > 1 file changed, 26 insertions(+), 4 deletions(-) > > Patch V2: > As discuss with Dietmar we will keep remove the former snapshot at the end of > the run, > but ignore if it fails. This is to prevent we have 2 replication snapshots at > time. > > Patch V3: > Remove cleanup_local_snapshots from eval as Diemar suggest. > Use Fabian commit message. > > diff --git a/PVE/Replication.pm b/PVE/Replication.pm > index 9bc4e61..98ba1b6 100644 > --- a/PVE/Replication.pm > +++ b/PVE/Replication.pm > @@ -136,8 +136,21 @@ sub prepare { > $last_snapshots->{$volid}->{$snap} = 1; > } elsif ($snap =~ m/^\Q$prefix\E/) { > $logfunc->("delete stale replication snapshot '$snap' on > $volid"); > - PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap); > - $cleaned_replicated_volumes->{$volid} = 1; > + > + eval { > + PVE::Storage::volume_snapshot_delete($storecfg, $volid, > $snap); > + $cleaned_replicated_volumes->{$volid} = 1; > + }; > + > + # If deleting the snapshot fails, we can not be sure if it was > due to an > error or a timeout. > + # The likelihood that the delete has worked out is high at a > timeout. > + # If it really fails, it will try to remove on the next run. > + > + # warn is for syslog/journal. > + warn $@ if $@; > + > + # logfunc will written in replication log. > + $logfunc->("delete stale replication snapshot error: $@") if $@; > } > } > } > @@ -296,9 +309,18 @@ sub replicate { > # remove old snapshots because they are no longer needed > $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname); > > - remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, > $start_time, $logfunc); > + eval { > + remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, > $start_time, $logfunc); > + }; > > - die $err if $err; > + # old snapshots will removed by next run from prepare_local_job. > + if ($err = $@) { > + # warn is for syslog/journal. > + warn $err; > + > + # logfunc will written in replication log. > + $logfunc->("delete stale replication snapshot error: err");
should be: $logfunc->("delete stale replication snapshot error: $err"); > + } > > return $volumes; > } > -- _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel