Re: [PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

2020-04-01 Thread Mikhail
On 4/1/20 11:03 AM, Fabian Grünbichler wrote:
> probably makes sense to move this to a bug report over at 
> https://bugzilla.proxmox.com
> 
> please include the following information:
> 
> pveversion -v
> storage config
> VM config
> ceph setup details
> 
> thanks!

Bug report submitted at https://bugzilla.proxmox.com/show_bug.cgi?id=2659

Mikhail.
___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

2020-04-01 Thread Fabian Grünbichler
On April 1, 2020 9:49 am, Mikhail wrote:
> On 4/1/20 10:45 AM, Mikhail wrote:
>> At the time of writing this message my colleague is doing some other
>> Disk move on the cluster and he said he hit same problem with another
>> VM's disk - 40GB in size - task stuck at the very beggining:
>> drive-scsi1: transferred: 427819008 bytes remaining: 70243188736 bytes
>> total: 70671007744 bytes progression: 0.61 % busy: 1 ready: 0
> 
> I just want to add that issue does not appear to be related to VM disk
> size - right now we have 3 stucked disk moves with different disk sizes:
> 
> 10GB, 20GB and 40GB.
> 
> The most recent one is 10GB disk move:
> 
> drive-scsi0: transferred: 2086797312 bytes remaining: 8572895232 bytes
> total: 10659692544 bytes progression: 19.58 % busy: 1 ready: 0

probably makes sense to move this to a bug report over at 
https://bugzilla.proxmox.com

please include the following information:

pveversion -v
storage config
VM config
ceph setup details

thanks!

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

2020-04-01 Thread Mikhail
On 4/1/20 10:45 AM, Mikhail wrote:
> At the time of writing this message my colleague is doing some other
> Disk move on the cluster and he said he hit same problem with another
> VM's disk - 40GB in size - task stuck at the very beggining:
> drive-scsi1: transferred: 427819008 bytes remaining: 70243188736 bytes
> total: 70671007744 bytes progression: 0.61 % busy: 1 ready: 0

I just want to add that issue does not appear to be related to VM disk
size - right now we have 3 stucked disk moves with different disk sizes:

10GB, 20GB and 40GB.

The most recent one is 10GB disk move:

drive-scsi0: transferred: 2086797312 bytes remaining: 8572895232 bytes
total: 10659692544 bytes progression: 19.58 % busy: 1 ready: 0

Mikhail.
___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

2020-04-01 Thread Mikhail
Hello Fabian!

On 4/1/20 9:38 AM, Fabian Grünbichler wrote:
>> drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
>> total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
>> drive-scsi0: Cancelling block job
> was the target some sort of network storage that started hanging? this 
> looks rather unusual..

We were able to reproduce this issue right now on the same cluster.
The Disk move operation was doing move from local "directory" type of
storage (VM disks reside as .qcow2 files) to attached CEPH storage pool.

We did 2 attempts on the same VM, with the same disk - first attempt
failed (disk transfer got stuck on 10.25 % progress):

deprecated setting 'migration_unsecure' and new 'migration: type' set at
same time! Ignore 'migration_unsecure'
create full clone of drive scsi1
(nvme-local-vm:82082108/vm-82082108-disk-1.qcow2)
drive mirror is starting for drive-scsi1
drive-scsi1: transferred: 737148928 bytes remaining: 20737687552 bytes
total: 21474836480 bytes progression: 3.43 % busy: 1 ready: 0
drive-scsi1: transferred: 1512046592 bytes remaining: 19962789888 bytes
total: 21474836480 bytes progression: 7.04 % busy: 1 ready: 0
drive-scsi1: transferred: 2198994944 bytes remaining: 19260243968 bytes
total: 21459238912 bytes progression: 10.25 % busy: 1 ready: 0
 here goes 230+ lines of the same 10.25 % progress status 
drive-scsi1: Cancelling block job

After cancelling this job I looked into VM's QM monitor to see if the
block-job is still there, and of course it is:

# info block-jobs
Type mirror, device drive-scsi1: Completed 2198994944 of 21459238912
bytes, speed limit 0 bytes/s

trying to cancel this block-job does nothing and our next step is to
shutdown the VM from Proxmox GUI - this also fails with the following in
task log:

TASK ERROR: VM quit/powerdown failed - got timeout

After that, we tried the following from SSH root console:

# qm stop 82082108
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL

and after that QM Monitor stopped to respond from Proxmox GUI as expected:

# info block-jobs
ERROR: VM 82082108 qmp command 'human-monitor-command' failed - unable
to connect to VM 82082108 qmp socket - timeout after 31 retries

So at this point the VM is completely stopped, disk not moved. The VM
was started again and we did same steps (Disk move) exactly as above. We
got identical restults - the Disk move operation got stuck:

drive-scsi1: transferred: 2187460608 bytes remaining: 19271778304 bytes
total: 21459238912 bytes progression: 10.19 % busy: 1 ready: 0

After cancelling Disk move operation all above symptoms persist - VM
won't shutdown from GUI, the block-job is visible from QM monitor and
won't cancel.

Our next test case was to do a Disk move offline - when VM is shutdown.
And guess what - this worked without a glitch. Same storage, same disk,
but VM is in stopped state.

But even after that, when the disk is on CEPH and VM is started and
running, we attempted to do Disk move from CEPH back to local storage
ONLINE - this also worked like a charm without any blocks or issues.

VM disk size we were moving back and forth isn't very big - only 20GB.

The problem is that this issue does not appear to be happening to every
virtual machine disk - we moved several disks before we hit this issue
again.

At the time of writing this message my colleague is doing some other
Disk move on the cluster and he said he hit same problem with another
VM's disk - 40GB in size - task stuck at the very beggining:
drive-scsi1: transferred: 427819008 bytes remaining: 70243188736 bytes
total: 70671007744 bytes progression: 0.61 % busy: 1 ready: 0

Let me know if I can provide some further information or do some
debugging - as we can reproduce this problem 100% now.

regards,
Mikhail.
___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

2020-04-01 Thread Fabian Grünbichler
On March 31, 2020 5:07 pm, Mikhail wrote:
> On 3/31/20 2:53 PM, Fabian Grünbichler wrote:
>> you should be able to manually clean the messup using the QMP/monitor 
>> interface:
>> 
>> `man qemu-qmp-ref` gives a detailed tour, you probably want
>> `query-block-jobs` and `query-block`, and then, depending on the output
>> `block-job-cancel` or `block-job-complete`.
>> 
>> the HMP interface accessible via 'qm monitor ' has slightly 
>> different commands: `info block -v`, `info block-jobs` and 
>> `block_job_cancel`/`block_job_complete` ('_' instead of '-').
> 
> Thanks for your prompt response.
> I've tried the following under VM's "Monitor" section within Proxmox WEB
> GUI:
> 
> # info block-jobs
> Type mirror, device drive-scsi0: Completed 6571425792 of 10725883904
> bytes, speed limit 0 bytes/s
> 
> and after that I tried to cancel this block job using:
> 
> # block_job_cancel -f drive-scsi0
> 
> However, the block job is still there even after 3 attempts trying to
> cancel it:
> 
> # info block-jobs
> Type mirror, device drive-scsi0: Completed 6571425792 of 10725883904
> bytes, speed limit 0 bytes/s
> 
> Same happens when I connect to it via root console using "qm monitor".
> 
> I guess this is now completely stuck and the only way would be to power
> off/on the VM?

well, you could investigate more with the QMP interface (it gives a lot 
more information). but yes, a shutdown/boot cycle should get rid of the 
block-job.

>> feel free to post the output of the query/info commands before deciding 
>> how to proceed. the complete task log of the failed 'move disk' 
>> operation would also be interesting, if it is still available.
> 
> I just asked my colleague who was cancelling this Disk move operation.
> He said he had to cancel it because it was stuck at 61.27%. The Disk
> move task log is below, I truncated repeating lines:
> 
> deprecated setting 'migration_unsecure' and new 'migration: type' set at
> same time! Ignore 'migration_unsecure'
> create full clone of drive scsi0 (nvme-local-vm:123/vm-123-disk-0.qcow2)
> drive mirror is starting for drive-scsi0
> drive-scsi0: transferred: 24117248 bytes remaining: 10713300992 bytes
> total: 10737418240 bytes progression: 0.22 % busy: 1 ready: 0
> drive-scsi0: transferred: 2452619264 bytes remaining: 6635388928 bytes
> total: 9088008192 bytes progression: 26.99 % busy: 1 ready: 0
> drive-scsi0: transferred: 3203399680 bytes remaining: 6643777536 bytes
> total: 9847177216 bytes progression: 32.53 % busy: 1 ready: 0
> drive-scsi0: transferred: 4001366016 bytes remaining: 6632243200 bytes
> total: 10633609216 bytes progression: 37.63 % busy: 1 ready: 0
> drive-scsi0: transferred: 4881121280 bytes remaining: 5856296960 bytes
> total: 10737418240 bytes progression: 45.46 % busy: 1 ready: 0
> drive-scsi0: transferred: 6554648576 bytes remaining: 4171235328 bytes
> total: 10725883904 bytes progression: 61.11 % busy: 1 ready: 0
> drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
> total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
> [ same line repeats like 250+ times ]
> drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
> total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
> drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
> total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
> drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
> total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
> drive-scsi0: Cancelling block job

was the target some sort of network storage that started hanging? this 
looks rather unusual..

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

2020-03-31 Thread Mikhail
On 3/31/20 2:53 PM, Fabian Grünbichler wrote:
> you should be able to manually clean the messup using the QMP/monitor 
> interface:
> 
> `man qemu-qmp-ref` gives a detailed tour, you probably want
> `query-block-jobs` and `query-block`, and then, depending on the output
> `block-job-cancel` or `block-job-complete`.
> 
> the HMP interface accessible via 'qm monitor ' has slightly 
> different commands: `info block -v`, `info block-jobs` and 
> `block_job_cancel`/`block_job_complete` ('_' instead of '-').

Thanks for your prompt response.
I've tried the following under VM's "Monitor" section within Proxmox WEB
GUI:

# info block-jobs
Type mirror, device drive-scsi0: Completed 6571425792 of 10725883904
bytes, speed limit 0 bytes/s

and after that I tried to cancel this block job using:

# block_job_cancel -f drive-scsi0

However, the block job is still there even after 3 attempts trying to
cancel it:

# info block-jobs
Type mirror, device drive-scsi0: Completed 6571425792 of 10725883904
bytes, speed limit 0 bytes/s

Same happens when I connect to it via root console using "qm monitor".

I guess this is now completely stuck and the only way would be to power
off/on the VM?

> 
> feel free to post the output of the query/info commands before deciding 
> how to proceed. the complete task log of the failed 'move disk' 
> operation would also be interesting, if it is still available.

I just asked my colleague who was cancelling this Disk move operation.
He said he had to cancel it because it was stuck at 61.27%. The Disk
move task log is below, I truncated repeating lines:

deprecated setting 'migration_unsecure' and new 'migration: type' set at
same time! Ignore 'migration_unsecure'
create full clone of drive scsi0 (nvme-local-vm:123/vm-123-disk-0.qcow2)
drive mirror is starting for drive-scsi0
drive-scsi0: transferred: 24117248 bytes remaining: 10713300992 bytes
total: 10737418240 bytes progression: 0.22 % busy: 1 ready: 0
drive-scsi0: transferred: 2452619264 bytes remaining: 6635388928 bytes
total: 9088008192 bytes progression: 26.99 % busy: 1 ready: 0
drive-scsi0: transferred: 3203399680 bytes remaining: 6643777536 bytes
total: 9847177216 bytes progression: 32.53 % busy: 1 ready: 0
drive-scsi0: transferred: 4001366016 bytes remaining: 6632243200 bytes
total: 10633609216 bytes progression: 37.63 % busy: 1 ready: 0
drive-scsi0: transferred: 4881121280 bytes remaining: 5856296960 bytes
total: 10737418240 bytes progression: 45.46 % busy: 1 ready: 0
drive-scsi0: transferred: 6554648576 bytes remaining: 4171235328 bytes
total: 10725883904 bytes progression: 61.11 % busy: 1 ready: 0
drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
[ same line repeats like 250+ times ]
drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
drive-scsi0: transferred: 6571425792 bytes remaining: 4154458112 bytes
total: 10725883904 bytes progression: 61.27 % busy: 1 ready: 0
drive-scsi0: Cancelling block job

regards,
Mikhail.
___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Problem with QEMU drive-mirror after cancelling VM disk move

2020-03-31 Thread Fabian Grünbichler
On March 31, 2020 12:21 pm, Mikhail wrote:
> Hello,
> 
> On one of our clusters we're seeing issues with VM backup task - the
> backup task fails with the following:
> 
> ERROR: Node 'drive-scsi0' is busy: block device is in use by block job:
> mirror
> INFO: aborting backup job
> ERROR: Backup of VM 123 failed - Node 'drive-scsi0' is busy: block
> device is in use by block job: mirror
> 
> I tried digging some information and it appears to be that there's
> "drive-mirror" job in QEMU/KVM for this particular VM is blocking backup
> process. It is clear to me that this problem started couple weeks ago
> when we attempted to change VM's disk underlying storage, however the
> "Move disk" operation was cancelled manually by administrator at the
> time and backup tasks started failing right after that. I'm not sure
> whether this is Proxmox issue or QEMU/KVM, but I suppose that stopping
> and starting VM from within Proxmox will remove this block, however in
> our case keeping this virtual machine up and running is critical and we
> should avoid even 1-2 min downtime.

yes, shutting the VM down and starting it again gets rid of any leftover 
block-jobs for sure.

> The question is how to remove this drive-mirror block online and how to
> avoid this in the future.

you should be able to manually clean the messup using the QMP/monitor 
interface:

`man qemu-qmp-ref` gives a detailed tour, you probably want
`query-block-jobs` and `query-block`, and then, depending on the output
`block-job-cancel` or `block-job-complete`.

the HMP interface accessible via 'qm monitor ' has slightly 
different commands: `info block -v`, `info block-jobs` and 
`block_job_cancel`/`block_job_complete` ('_' instead of '-').

feel free to post the output of the query/info commands before deciding 
how to proceed. the complete task log of the failed 'move disk' 
operation would also be interesting, if it is still available.

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user