Re: [ovirt-users] moving disk failed.. remained locked

2017-02-23 Thread David Teigland
On Thu, Feb 23, 2017 at 08:11:50PM +0200, Nir Soffer wrote:
> > [g.cecchi@ovmsrv05 ~]$ sudo sanlock client renewal -s
> > 922b5269-ab56-4c4d-838f-49d33427e2ab
> > timestamp=1207533 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > timestamp=1207554 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > ...
> > timestamp=1211163 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > timestamp=1211183 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > timestamp=1211204 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> >
> > How do I translate this output above? What would be the difference in case
> > of problems?
> 
> David, can you explain this output?
> 
> read_ms and write_ms looks obvious, but next_timeout and next_errors are
> a mystery to me.

Sorry for copying, but I think I explained it better then than I could now!
(I need to include this somewhere in the man page.)

commit 6313c709722b3ba63234a75d1651a160bf1728ee
Author: David Teigland 
Date:   Wed Mar 9 11:58:21 2016 -0600

sanlock: renewal history

Keep a history of read and write latencies for a lockspace.
The times are measured for io in delta lease renewal
(each delta lease renewal includes one read and one write).

For each successful renewal, a record is saved that includes:
- the timestamp written in the delta lease by the renewal
- the time in milliseconds taken by the delta lease read
- the time in milliseconds taken by the delta lease write

Also counted and recorded are the number io timeouts and
other io errors that occur between successful renewals.

Two consecutive successful renewals would be recorded as:

timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0

timestamp is the value written into the delta lease during
that renewal.

read_ms/write_ms are the milliseconds taken for the renewal
read/write ios.

next_timeouts are the number of io timeouts that occured
after the renewal recorded on that line and before the next
successful renewal on the following line.

next_errors are the number of io errors (not timeouts) that
occured after renewal recorded on that line and before the
next successful renewal on the following line.

The command 'sanlock client renewal -s lockspace_name' reports
the full history of renewals saved by sanlock, which by default
is 180 records, about 1 hour of history when using a 20 second
renewal interval for a 10 second io timeout.

(A --summary option could be added to calculate and report
averages over a selected period of the history.)


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-23 Thread Nir Soffer
On Wed, Feb 22, 2017 at 12:20 PM, Gianluca Cecchi
 wrote:
> On Wed, Feb 22, 2017 at 10:59 AM, Nir Soffer  wrote:
>>
>>
>>
>> Lesson, use only storage without problems ;-)
>
>
> hopefully... ;-)
>
>>
>> >> Can you share the output of:
>> >>
>> >> sanlock client renewal -s 900b1853-e192-4661-a0f9-7c7c396f6f49
>> >
>> >
>> > No, the storage domain has been removed
>>
>> Next time when you have storage issues, please remember to grab
>> the output of this command.
>>
>> Nir
>
>
>
>
> For example, on a currently active storage domain I get:
>
> [g.cecchi@ovmsrv05 ~]$ sudo sanlock client renewal -s
> 922b5269-ab56-4c4d-838f-49d33427e2ab
> timestamp=1207533 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> timestamp=1207554 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> ...
> timestamp=1211163 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> timestamp=1211183 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> timestamp=1211204 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
>
> How do I translate this output above? What would be the difference in case
> of problems?

David, can you explain this output?

read_ms and write_ms looks obvious, but next_timeout and next_errors are
a mystery to me.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-22 Thread Gianluca Cecchi
On Wed, Feb 22, 2017 at 10:59 AM, Nir Soffer  wrote:

>
>
> Lesson, use only storage without problems ;-)
>

hopefully... ;-)


> >> Can you share the output of:
> >>
> >> sanlock client renewal -s 900b1853-e192-4661-a0f9-7c7c396f6f49
> >
> >
> > No, the storage domain has been removed
>
> Next time when you have storage issues, please remember to grab
> the output of this command.
>
> Nir
>



For example, on a currently active storage domain I get:

[g.cecchi@ovmsrv05 ~]$ sudo sanlock client renewal -s
922b5269-ab56-4c4d-838f-49d33427e2ab
timestamp=1207533 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
timestamp=1207554 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
...
timestamp=1211163 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
timestamp=1211183 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
timestamp=1211204 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0

How do I translate this output above? What would be the difference in case
of problems?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-22 Thread Nir Soffer
On Wed, Feb 22, 2017 at 11:45 AM, Gianluca Cecchi
 wrote:
> On Wed, Feb 22, 2017 at 9:56 AM, Nir Soffer  wrote:
>>
>>
>>
>> Gianluca, what is domain 900b1853-e192-4661-a0f9-7c7c396f6f49?
>>
>> is this the domain you are migrating to in the same time?
>
>
> That was the id of the storage domain created on the LUN with problems at
> storage array level.

This explains sanlock issues with this domain.

> It only contained one disk of a VM. I was able to previously move other 2
> disks I had on it to another storage domain
>
> The disk was a data disk of a VM; its system disk was on another storage
> domain without problems
>
> The order of my operations yesterday was:
> - try move disk to another storge domain-> failure in auto snapshot
> - try snapshot of VM selecting both disks --> failure

The first step in moving disk to another domain when the vm is online,
is creating a snapshot on old storage.

Then we start mirroring process of the active (empty) snapshot
to the destination storage domain.

Then we copy the rest of the chain (readonly) to the destination
storage domain.

Finally we switch the active layer to the snapshot on the destination
storage domain, and delete the old chain on the source domain.

If the source storage is broken you have to stop the vm to move the
disk. This is can also fail if we cannot read the disk from this storage.

Lesson, use only storage without problems ;-)

> - try snapshot of VM selecting only the system disk (the good one) --> ok
> and also snapshot deletion ok
> - try snapshot of VM selecting only the data disk --> failure
> - hot add disk (in a good storage domain) to the VM --> OK
> - try pvmove at VM OS level from problematic disk to new disk --> failure:
> VM paused at 47% of pvmove and not able to continue
> - power off VM --> OK
> - remove disk from VM and delete --> OK
>
> Only at this point, with storage domain empty, I started to work on storage
> domain itself, putting it to maintenance and removing it without problems;
> and then the related LUN removal at host level with the notes described in
> other thread
>
>>
>>
>> Can you share the output of:
>>
>> sanlock client renewal -s 900b1853-e192-4661-a0f9-7c7c396f6f49
>
>
> No, the storage domain has been removed

Next time when you have storage issues, please remember to grab
the output of this command.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-22 Thread Gianluca Cecchi
On Wed, Feb 22, 2017 at 9:56 AM, Nir Soffer  wrote:

>
>
> Gianluca, what is domain 900b1853-e192-4661-a0f9-7c7c396f6f49?
>
> is this the domain you are migrating to in the same time?
>

That was the id of the storage domain created on the LUN with problems at
storage array level.
It only contained one disk of a VM. I was able to previously move other 2
disks I had on it to another storage domain

The disk was a data disk of a VM; its system disk was on another storage
domain without problems

The order of my operations yesterday was:
- try move disk to another storge domain-> failure in auto snapshot
- try snapshot of VM selecting both disks --> failure
- try snapshot of VM selecting only the system disk (the good one) --> ok
and also snapshot deletion ok
- try snapshot of VM selecting only the data disk --> failure
- hot add disk (in a good storage domain) to the VM --> OK
- try pvmove at VM OS level from problematic disk to new disk --> failure:
VM paused at 47% of pvmove and not able to continue
- power off VM --> OK
- remove disk from VM and delete --> OK

Only at this point, with storage domain empty, I started to work on storage
domain itself, putting it to maintenance and removing it without problems;
and then the related LUN removal at host level with the notes described in
other thread.


>
> Can you share the output of:
>
> sanlock client renewal -s 900b1853-e192-4661-a0f9-7c7c396f6f49
>

No, the storage domain has been removed

Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-22 Thread Nir Soffer
On Wed, Feb 22, 2017 at 10:32 AM, Nir Soffer  wrote:
> On Wed, Feb 22, 2017 at 10:31 AM, Nir Soffer  wrote:
>> On Mon, Feb 20, 2017 at 4:49 PM, Gianluca Cecchi
>>  wrote:
>>> Hello,
>>> I'm trying to move a disk from one storage domain A to another B in oVirt
>>> 4.1
>>> The corresponding VM is powered on in the mean time
>>>
>>> When executing the action, there was already in place a disk move from
>>> storage domain C to A (this move was for a disk of a powered off VM and then
>>> completed ok)
>>> I got this in events of webadmin gui for the failed move A -> B:
>>>
>>> Feb 20, 2017 2:42:00 PM Failed to complete snapshot 'Auto-generated for Live
>>> Storage Migration' creation for VM 'dbatest6'.
>>> Feb 20, 2017 2:40:51 PM VDSM ovmsrv06 command HSMGetAllTasksStatusesVDS
>>> failed: Error creating a new volume
>>> Feb 20, 2017 2:40:51 PM Snapshot 'Auto-generated for Live Storage Migration'
>>> creation for VM 'dbatest6' was initiated by admin@internal-authz.
>>>
>>>
>>> And in relevant vdsm.log of referred host ovmsrv06
>>>
>>> 2017-02-20 14:41:44,899 ERROR (tasks/8) [storage.Volume] Unexpected error
>>> (volume:1087)
>>> Traceback (most recent call last):
>>>   File "/usr/share/vdsm/storage/volume.py", line 1081, in create
>>> cls.newVolumeLease(metaId, sdUUID, volUUID)
>>>   File "/usr/share/vdsm/storage/volume.py", line 1361, in newVolumeLease
>>> return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
>>>   File "/usr/share/vdsm/storage/blockVolume.py", line 310, in newVolumeLease
>>> sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)])
>>> SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock
>>> exception')
>>
>> This means that sanlock could not initialize a lease in the new volume 
>> created
>> for the snapshot.

David, looking in sanlock log - we don't see any error matching this failure,
but the domain 900b1853-e192-4661-a0f9-7c7c396f6f49 has renewal errors.

I guess because sanlock_init_resource is implemented in the library,
not going trough sanlock deamon?

2017-02-20 14:30:09+0100 1050804 [11738]: 900b1853 aio timeout RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 ioto 10 to_count 1
2017-02-20 14:30:09+0100 1050804 [11738]: s3 delta_renew read timeout
10 sec offset 0 /dev/900b1853-e192-4661-a0f9-7c7c396f6f49/ids
2017-02-20 14:30:09+0100 1050804 [11738]: s3 renewal error -202
delta_length 10 last_success 1050773
2017-02-20 14:30:11+0100 1050806 [11738]: 900b1853 aio collect RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 result 1048576:0 match
reap
2017-02-20 14:35:58+0100 1051153 [11738]: 900b1853 aio timeout RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 ioto 10 to_count 2
2017-02-20 14:35:58+0100 1051153 [11738]: s3 delta_renew read timeout
10 sec offset 0 /dev/900b1853-e192-4661-a0f9-7c7c396f6f49/ids
2017-02-20 14:35:58+0100 1051153 [11738]: s3 renewal error -202
delta_length 10 last_success 1051122
2017-02-20 14:36:01+0100 1051156 [11738]: 900b1853 aio collect RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 result 1048576:0 match
reap
2017-02-20 14:44:36+0100 1051671 [11738]: 900b1853 aio timeout RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 ioto 10 to_count 3
2017-02-20 14:44:36+0100 1051671 [11738]: s3 delta_renew read timeout
10 sec offset 0 /dev/900b1853-e192-4661-a0f9-7c7c396f6f49/ids
2017-02-20 14:44:36+0100 1051671 [11738]: s3 renewal error -202
delta_length 10 last_success 1051641
2017-02-20 14:44:37+0100 1051672 [11738]: 900b1853 aio collect RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 result 1048576:0 match
reap
2017-02-20 14:48:02+0100 1051877 [11738]: 900b1853 aio timeout RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 ioto 10 to_count 4
2017-02-20 14:48:02+0100 1051877 [11738]: s3 delta_renew read timeout
10 sec offset 0 /dev/900b1853-e192-4661-a0f9-7c7c396f6f49/ids
2017-02-20 14:48:02+0100 1051877 [11738]: s3 renewal error -202
delta_length 10 last_success 1051846
2017-02-20 14:48:02+0100 1051877 [11738]: 900b1853 aio collect RD
0x7f41d8c0:0x7f41d8d0:0x7f41e2afa000 result 1048576:0 match
reap

Gianluca, what is domain 900b1853-e192-4661-a0f9-7c7c396f6f49?

is this the domain you are migrating to in the same time?

Can you share the output of:

sanlock client renewal -s 900b1853-e192-4661-a0f9-7c7c396f6f49

>>> 2017-02-20 14:41:44,900 ERROR (tasks/8) [storage.TaskManager.Task]
>>> (Task='d694b892-b078-4d86-a035-427ee4fb3b13') Unexpected error (task:870)
>>> Traceback (most recent call last):
>>>   File "/usr/share/vdsm/storage/task.py", line 877, in _run
>>> return fn(*args, **kargs)
>>>   File "/usr/share/vdsm/storage/task.py", line 333, in run
>>> return self.cmd(*self.argslist, **self.argsdict)
>>>   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
>>> 79, in wrapper
>>> return method(self, *args, **kwargs)
>>>   File "/usr/share/vdsm/storage/sp.py", line 1929, in createVolume
>>> initialSize=initialSize)
>>>   File "/usr/share/vdsm/storage/sd.py", line 762, in creat

Re: [ovirt-users] moving disk failed.. remained locked

2017-02-22 Thread Gianluca Cecchi
On Wed, Feb 22, 2017 at 9:32 AM, Nir Soffer  wrote:

> On Wed, Feb 22, 2017 at 10:31 AM, Nir Soffer  wrote:
>
> >
> > This means that sanlock could not initialize a lease in the new volume
> created
> > for the snapshot.
> >
> > Can you attach sanlock.log?
>
> Found it in your next message
>
>
OK.
Just to recap what happened from a physical point of view:

- apparently I had an array of disks with no more spare disks and on this
array was the LUN composing the disk storage domain.
So I was in involved in moving disks of the impacted storage domain and
then removal of storage domain itself, so that we can remove the logical
array on storage
This is a test storage system without support so at the moment I had no
more spare disks on it

- actually there was another disk problem with the array, generating loss
of data because of no more spare available at that time

- No evidence of error at VM OS level and at storage domain level

- But probably the 2 operations:
1) move disk
2) create snapshot of the VM containing the disk
could not complete due to this low level problem

It should be nice to find an evidence to this. Storage domain didn't go
offline BTW

- I got confirmation of the loss of data this way:
The original disk of the VM, inside the VM, was a PV of a VG
I added a disk (on another storage domain) to the VM, made it a PV and
added to the original VG
Tried pvmove from source disk to new disk, but it reached about 47% and
then stopped/failed, pausing the VM.
I could start again the VM but as soon as the pvmove continued, the VM came
back to paused state.
So I powered off the VM and was able to detach/delete the corrupted disk
and then remove the storage domain (see other thread opened yesterday)

I then managed to recover the now corrupted VG and restore from backup the
data contained in original fs.

So the original problem was low level error of storage.
If can be of help to narrow down oVirt behavior in this case scenario I can
provide further logs from VM OS or from hosts/engine.
Let me know.

Some questions:
- how is it managed the reaction of putting VM in paused mode due to I/O
error as in this case? Can I in some way manage to keep VM on a ndlet it
generate errors as in real physical server or not?
- Why I didn't get any message at storage domain level but only at VM disk
level?

Thanks for the given help
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-22 Thread Nir Soffer
On Wed, Feb 22, 2017 at 10:31 AM, Nir Soffer  wrote:
> On Mon, Feb 20, 2017 at 4:49 PM, Gianluca Cecchi
>  wrote:
>> Hello,
>> I'm trying to move a disk from one storage domain A to another B in oVirt
>> 4.1
>> The corresponding VM is powered on in the mean time
>>
>> When executing the action, there was already in place a disk move from
>> storage domain C to A (this move was for a disk of a powered off VM and then
>> completed ok)
>> I got this in events of webadmin gui for the failed move A -> B:
>>
>> Feb 20, 2017 2:42:00 PM Failed to complete snapshot 'Auto-generated for Live
>> Storage Migration' creation for VM 'dbatest6'.
>> Feb 20, 2017 2:40:51 PM VDSM ovmsrv06 command HSMGetAllTasksStatusesVDS
>> failed: Error creating a new volume
>> Feb 20, 2017 2:40:51 PM Snapshot 'Auto-generated for Live Storage Migration'
>> creation for VM 'dbatest6' was initiated by admin@internal-authz.
>>
>>
>> And in relevant vdsm.log of referred host ovmsrv06
>>
>> 2017-02-20 14:41:44,899 ERROR (tasks/8) [storage.Volume] Unexpected error
>> (volume:1087)
>> Traceback (most recent call last):
>>   File "/usr/share/vdsm/storage/volume.py", line 1081, in create
>> cls.newVolumeLease(metaId, sdUUID, volUUID)
>>   File "/usr/share/vdsm/storage/volume.py", line 1361, in newVolumeLease
>> return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
>>   File "/usr/share/vdsm/storage/blockVolume.py", line 310, in newVolumeLease
>> sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)])
>> SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock
>> exception')
>
> This means that sanlock could not initialize a lease in the new volume created
> for the snapshot.
>
> Can you attach sanlock.log?

Found it in your next message

>
>> 2017-02-20 14:41:44,900 ERROR (tasks/8) [storage.TaskManager.Task]
>> (Task='d694b892-b078-4d86-a035-427ee4fb3b13') Unexpected error (task:870)
>> Traceback (most recent call last):
>>   File "/usr/share/vdsm/storage/task.py", line 877, in _run
>> return fn(*args, **kargs)
>>   File "/usr/share/vdsm/storage/task.py", line 333, in run
>> return self.cmd(*self.argslist, **self.argsdict)
>>   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
>> 79, in wrapper
>> return method(self, *args, **kwargs)
>>   File "/usr/share/vdsm/storage/sp.py", line 1929, in createVolume
>> initialSize=initialSize)
>>   File "/usr/share/vdsm/storage/sd.py", line 762, in createVolume
>> initialSize=initialSize)
>>   File "/usr/share/vdsm/storage/volume.py", line 1089, in create
>> (volUUID, e))
>> VolumeCreationError: Error creating a new volume: (u"Volume creation
>> d0d938bd-1479-49cb-93fb-85b6a32d6cb4 failed: (-202, 'Sanlock resource init
>> failure', 'Sanlock exception')",)
>> 2017-02-20 14:41:44,941 INFO  (tasks/8) [storage.Volume] Metadata rollback
>> for sdUUID=900b1853-e192-4661-a0f9-7c7c396f6f49 offs=8 (blockVolume:448)
>>
>>
>> Was the error generated due to the other migration still in progress?
>> Is there a limit of concurrent migrations from/to a particular storage
>> domain?
>
> No, maybe your network was overloaded by the concurrent migrations?
>
>>
>> Now I would like to retry, but I see that the disk is in state locked with
>> hourglass.
>> The autogenerated snapshot of the failed action was apparently removed with
>> success as I don't see it.
>>
>> How can I proceed to move the disk?
>>
>> Thanks in advance,
>> Gianluca
>>
>> ___
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-22 Thread Nir Soffer
On Mon, Feb 20, 2017 at 4:49 PM, Gianluca Cecchi
 wrote:
> Hello,
> I'm trying to move a disk from one storage domain A to another B in oVirt
> 4.1
> The corresponding VM is powered on in the mean time
>
> When executing the action, there was already in place a disk move from
> storage domain C to A (this move was for a disk of a powered off VM and then
> completed ok)
> I got this in events of webadmin gui for the failed move A -> B:
>
> Feb 20, 2017 2:42:00 PM Failed to complete snapshot 'Auto-generated for Live
> Storage Migration' creation for VM 'dbatest6'.
> Feb 20, 2017 2:40:51 PM VDSM ovmsrv06 command HSMGetAllTasksStatusesVDS
> failed: Error creating a new volume
> Feb 20, 2017 2:40:51 PM Snapshot 'Auto-generated for Live Storage Migration'
> creation for VM 'dbatest6' was initiated by admin@internal-authz.
>
>
> And in relevant vdsm.log of referred host ovmsrv06
>
> 2017-02-20 14:41:44,899 ERROR (tasks/8) [storage.Volume] Unexpected error
> (volume:1087)
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/volume.py", line 1081, in create
> cls.newVolumeLease(metaId, sdUUID, volUUID)
>   File "/usr/share/vdsm/storage/volume.py", line 1361, in newVolumeLease
> return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
>   File "/usr/share/vdsm/storage/blockVolume.py", line 310, in newVolumeLease
> sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)])
> SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock
> exception')

This means that sanlock could not initialize a lease in the new volume created
for the snapshot.

Can you attach sanlock.log?

> 2017-02-20 14:41:44,900 ERROR (tasks/8) [storage.TaskManager.Task]
> (Task='d694b892-b078-4d86-a035-427ee4fb3b13') Unexpected error (task:870)
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 877, in _run
> return fn(*args, **kargs)
>   File "/usr/share/vdsm/storage/task.py", line 333, in run
> return self.cmd(*self.argslist, **self.argsdict)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
> 79, in wrapper
> return method(self, *args, **kwargs)
>   File "/usr/share/vdsm/storage/sp.py", line 1929, in createVolume
> initialSize=initialSize)
>   File "/usr/share/vdsm/storage/sd.py", line 762, in createVolume
> initialSize=initialSize)
>   File "/usr/share/vdsm/storage/volume.py", line 1089, in create
> (volUUID, e))
> VolumeCreationError: Error creating a new volume: (u"Volume creation
> d0d938bd-1479-49cb-93fb-85b6a32d6cb4 failed: (-202, 'Sanlock resource init
> failure', 'Sanlock exception')",)
> 2017-02-20 14:41:44,941 INFO  (tasks/8) [storage.Volume] Metadata rollback
> for sdUUID=900b1853-e192-4661-a0f9-7c7c396f6f49 offs=8 (blockVolume:448)
>
>
> Was the error generated due to the other migration still in progress?
> Is there a limit of concurrent migrations from/to a particular storage
> domain?

No, maybe your network was overloaded by the concurrent migrations?

>
> Now I would like to retry, but I see that the disk is in state locked with
> hourglass.
> The autogenerated snapshot of the failed action was apparently removed with
> success as I don't see it.
>
> How can I proceed to move the disk?
>
> Thanks in advance,
> Gianluca
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-21 Thread Fred Rolland
I opened a bug for the task cleaner issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1425705

Did you managed to copy the disk ?
For better tracking, can you open a bug with the details and logs ?

Thanks,
Fred

On Tue, Feb 21, 2017 at 3:36 PM, Gianluca Cecchi 
wrote:

> The problem itself seems related with snapshot and with the disk (430Gb in
> size).
>
> Failed to complete snapshot 'test3' creation for VM 'dbatest6'.
> VDSM ovmsrv07 command HSMGetAllTasksStatusesVDS failed: Could not acquire
> resource. Probably resource factory threw an exception.: ()
> Snapshot 'test3' creation for VM 'dbatest6' was initiated by
> admin@internal-authz.
>
> The VM is composed by 2 disks, that are on 2 different storage domains.
> I'm able to create and then delete a snapshot that includes only the first
> system disk (no memory saved), but I receive the same error as in the move
> disk if I try to do a snapshot including instead only the second disk
> (again no memory save).
> In this case the disk doesn't remain locked as it happened when trying to
> move the disk...
> Can it help in any way to shutdown the VM?
>
> I should free this storage domain and this is the only disk remained
> before decommission...
> Thanks,
> Gianluca
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-21 Thread Gianluca Cecchi
The problem itself seems related with snapshot and with the disk (430Gb in
size).

Failed to complete snapshot 'test3' creation for VM 'dbatest6'.
VDSM ovmsrv07 command HSMGetAllTasksStatusesVDS failed: Could not acquire
resource. Probably resource factory threw an exception.: ()
Snapshot 'test3' creation for VM 'dbatest6' was initiated by
admin@internal-authz.

The VM is composed by 2 disks, that are on 2 different storage domains.
I'm able to create and then delete a snapshot that includes only the first
system disk (no memory saved), but I receive the same error as in the move
disk if I try to do a snapshot including instead only the second disk
(again no memory save).
In this case the disk doesn't remain locked as it happened when trying to
move the disk...
Can it help in any way to shutdown the VM?

I should free this storage domain and this is the only disk remained before
decommission...
Thanks,
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-21 Thread Gianluca Cecchi
On Tue, Feb 21, 2017 at 11:47 AM, Fred Rolland  wrote:

> Add before the command (with your db password): PGPASSWORD=engine
>
> for example:
> PGPASSWORD=engine /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -T
>
> PGPASSWORD=engine /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh
> -t disk -u engine -q
>
>
>From taskcleaner, if I use the "-T" option I get error

[root@ovmgr1 ovirt-engine]# PGPASSWORD=my_pwd
/usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -d engine -u engine -T
 t
ERROR:  column "job_id" does not exist
LINE 1: ...created_at,status,return_value,return_value_class,job_id,ste...
 ^
FATAL: Cannot execute sql command: --command=SELECT
command_id,command_type,root_command_id,command_parameters,command_params_class,created_at,status,return_value,return_value_class,job_id,step_id,executed
FROM GetAllCommandsWithRunningTasks();

I see the function GetAllCommandsWithRunningTasks as defined only
in /usr/share/ovirt-engine/setup/dbutils/taskcleaner_sp_3_5.sql
and it seems it makes query on commands_entities, but if I directly go
inside db, the table doesn't contain indeed a job_id column

I'm on 4.1 upgraded from 4.0.6

engine=# \d command_entities
  Table "public.command_entities"
Column |   Type   |Modifiers

---+--+-
 command_id| uuid | not null
 command_type  | integer  | not null
 root_command_id   | uuid |
 command_parameters| text |
 command_params_class  | character varying(256)   |
 created_at| timestamp with time zone |
 status| character varying(20)| default NULL::character
varying
 callback_enabled  | boolean  | default false
 callback_notified | boolean  | default false
 return_value  | text |
 return_value_class| character varying(256)   |
 executed  | boolean  | default false
 user_id   | uuid |
 parent_command_id | uuid |
 data  | text |
 engine_session_seq_id | bigint   |
 command_context   | text |
Indexes:
"pk_command_entities" PRIMARY KEY, btree (command_id)
"idx_root_command_id" btree (root_command_id) WHERE root_command_id IS
NOT NULL
Referenced by:
TABLE "command_assoc_entities" CONSTRAINT
"fk_coco_command_assoc_entity" FOREIGN KEY (command_id) REFERENCES comm
and_entities(command_id) ON DELETE CASCADE

engine=#

Anyway after unlocking the disk and retrying the move, I get the same error
while creating auto snapshot... the first problem on host (that is a
different host from the chosen yesterday) seems

MetaDataKeyNotFoundError: Meta Data key not found error: ("Missing metadata
key: 'DOMAIN': found: {'NONE':


2017-02-21 11:38:58,985 INFO  (jsonrpc/0) [dispatcher] Run and protect:
createVolume(sdUUID=u'900b1853-e192-4661-a0f9-7c7c396f6f49',
spUUID=u'588237b8-0031-02f6-035d-0136',
imgUUID=u'f0b5a0e4-ee5d-44a7-ba07-08285791368a', size=u'461708984320',
volFormat=4, preallocate=2, diskType=2,
volUUID=u'c39c3d9f-dde8-45ab-b4a9-7c3b45c6391d', desc=u'',
srcImgUUID=u'f0b5a0e4-ee5d-44a7-ba07-08285791368a',
srcVolUUID=u'7ed43974-1039-4a68-a8b3-321e7594fe4c', initialSize=None)
(logUtils:49)
2017-02-21 11:38:58,987 INFO  (jsonrpc/0) [IOProcessClient] Starting client
ioprocess-6269 (__init__:330)
2017-02-21 11:38:59,006 INFO  (ioprocess/32170) [IOProcess] Starting
ioprocess (__init__:452)
2017-02-21 11:38:59,040 INFO  (jsonrpc/0) [dispatcher] Run and protect:
createVolume, Return response: None (logUtils:52)
2017-02-21 11:38:59,053 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call
Volume.create succeeded in 0.07 seconds (__init__:515)
2017-02-21 11:38:59,054 INFO  (tasks/9) [storage.ThreadPool.WorkerThread]
START task 08d7797a-af46-489f-ada0-c70bf4359366 (cmd=>, args=None)
(threadPool:208)
2017-02-21 11:38:59,150 WARN  (tasks/9) [storage.ResourceManager] Resource
factory failed to create resource
'01_img_900b1853-e192-4661-a0f9-7c7c396f6f49.f0b5a0e4-ee5d-44a7-ba07-08285791368a'.
Canceling request. (resourceManager:542)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/resourceManager.py", line 538, in
registerResource
obj = namespaceObj.factory.createResource(name, lockType)
  File "/usr/share/vdsm/storage/resourceFactories.py", line 190, in
createResource
lockType)
  File "/usr/share/vdsm/storage/resourceFactories.py", line 119, in
__getResourceCandidatesList
imgUUID=resourceName)
  File "/usr/share/vdsm/storage/image.py", line 220, in getChain
if srcVol.isLeaf():
  File "/usr/share/vdsm/storage/volume.py", line 1261, in isLeaf
return self._manifest.is

Re: [ovirt-users] moving disk failed.. remained locked

2017-02-21 Thread Fred Rolland
Add before the command (with your db password): PGPASSWORD=engine

for example:
PGPASSWORD=engine /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -T

PGPASSWORD=engine /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t
disk -u engine -q

On Tue, Feb 21, 2017 at 12:23 PM, Gianluca Cecchi  wrote:

>
> I see here utilitues:
> https://www.ovirt.org/develop/developer-guide/db-issues/helperutilities/
>
> In particular unlock_entity.sh that should be of help in my case, as I see
> here:
> http://lists.ovirt.org/pipermail/users/2015-April/032576.html
>
> New path in 4.1 is now
> /usr/share/ovirt-engine/setup/dbutils/
> and not
> /usr/share/ovirt-engine/dbscripts
>
> Question:
> How can I verify that "no jobs are still running over it"?
> Is taskcleaner.sh the utility to crosscheck jobs?
>
> In this case how do I provide a password for it?
>
> [root@ovmgr1 ~]# /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -d
> engine -u engine
> psql: fe_sendauth: no password supplied
> FATAL: Cannot execute sql command: --command=select exists (select * from
> information_schema.tables where table_schema = 'public' and table_name =
> 'command_entities');
> psql: fe_sendauth: no password supplied
> FATAL: Cannot execute sql command: --file=/usr/share/ovirt-
> engine/setup/dbutils/taskcleaner_sp.sql
>
> [root@ovmgr1 ~]# /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -h
> Usage: /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh [options]
>
> -h- This help text.
> -v- Turn on verbosity (WARNING:
> lots of output)
> -l LOGFILE- The logfile for capturing output  (def. )
> -s HOST   - The database servername for the database  (def.
> localhost)
> -p PORT   - The database port for the database(def. 5432)
> -u USER   - The username for the database (def. )
> -d DATABASE   - The database name (def. )
> -t TASK_ID- Removes a task by its Task ID.
> -c COMMAND_ID - Removes all tasks related to the given Command Id.
> -T- Removes/Displays all commands that have running tasks
> -o- Removes/Displays all commands.
> -z- Removes/Displays a Zombie task.
> -R- Removes all tasks (use with -z to clear only zombie
> tasks).
> -r- Removes all commands (use with -T to clear only those
> with running tasks. Use with -Z to clear only commands with zombie tasks.
> -Z- Removes/Displays a command with zombie tasks.
> -C- Clear related compensation entries.
> -J- Clear related Job Steps.
> -A- Clear all Job Steps and compensation entries.
> -q- Quite mode, do not prompt for confirmation.
>
> Thanks,
> Gianluca
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-21 Thread Gianluca Cecchi
I see here utilitues:
https://www.ovirt.org/develop/developer-guide/db-issues/helperutilities/

In particular unlock_entity.sh that should be of help in my case, as I see
here:
http://lists.ovirt.org/pipermail/users/2015-April/032576.html

New path in 4.1 is now
/usr/share/ovirt-engine/setup/dbutils/
and not
/usr/share/ovirt-engine/dbscripts

Question:
How can I verify that "no jobs are still running over it"?
Is taskcleaner.sh the utility to crosscheck jobs?

In this case how do I provide a password for it?

[root@ovmgr1 ~]# /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -d
engine -u engine
psql: fe_sendauth: no password supplied
FATAL: Cannot execute sql command: --command=select exists (select * from
information_schema.tables where table_schema = 'public' and table_name =
'command_entities');
psql: fe_sendauth: no password supplied
FATAL: Cannot execute sql command:
--file=/usr/share/ovirt-engine/setup/dbutils/taskcleaner_sp.sql

[root@ovmgr1 ~]# /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -h
Usage: /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh [options]

-h- This help text.
-v- Turn on verbosity (WARNING:
lots of output)
-l LOGFILE- The logfile for capturing output  (def. )
-s HOST   - The database servername for the database  (def.
localhost)
-p PORT   - The database port for the database(def. 5432)
-u USER   - The username for the database (def. )
-d DATABASE   - The database name (def. )
-t TASK_ID- Removes a task by its Task ID.
-c COMMAND_ID - Removes all tasks related to the given Command Id.
-T- Removes/Displays all commands that have running tasks
-o- Removes/Displays all commands.
-z- Removes/Displays a Zombie task.
-R- Removes all tasks (use with -z to clear only zombie
tasks).
-r- Removes all commands (use with -T to clear only those
with running tasks. Use with -Z to clear only commands with zombie tasks.
-Z- Removes/Displays a command with zombie tasks.
-C- Clear related compensation entries.
-J- Clear related Job Steps.
-A- Clear all Job Steps and compensation entries.
-q- Quite mode, do not prompt for confirmation.

Thanks,
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-21 Thread Gianluca Cecchi
On Tue, Feb 21, 2017 at 7:01 AM, Gianluca Cecchi 
wrote:

> On Mon, Feb 20, 2017 at 10:51 PM, Gianluca Cecchi <
> gianluca.cec...@gmail.com> wrote:
>
>> On Mon, Feb 20, 2017 at 8:46 PM, Fred Rolland 
>> wrote:
>>
>>> Can you please send the whole logs ? (Engine, vdsm and sanlock)
>>>
>>>
>> vdsm.log.1.xz:
>> https://drive.google.com/file/d/0BwoPbcrMv8mvWTViWEUtNjRtLTg
>> /view?usp=sharing
>>
>> sanlock.log
>> https://drive.google.com/file/d/0BwoPbcrMv8mvcVM4YzZ4aUZLYVU
>> /view?usp=sharing
>>
>> engine.log (gzip format);
>> https://drive.google.com/file/d/0BwoPbcrMv8mvdW80RlFIYkpzenc
>> /view?usp=sharing
>>
>> Thanks,
>> Gianluca
>>
>>
> I didn't say that size of disk is 430Gb and target storage domain is 1Tb,
> almost empty (950Gb free)
> I received a message about problems from the storage where the the disk is
> and so I'm trying to move it so that I can put under maintenance the
> original one and see.
> The errors seem about destination creation of volume and not source...
> thanks,
> Gianluca
>
>

Info on disk:

[g.cecchi@ovmsrv07 ~]$ sudo qemu-img info
/rhev/data-center/588237b8-0031-02f6-035d-0136/900b1853-e192-4661-a0f9-7c7c396f6f49/images/f0b5a0e4-ee5d-44a7-ba07-08285791368a/7ed43974-1039-4a68-a8b3-321e7594fe4c
image:
/rhev/data-center/588237b8-0031-02f6-035d-0136/900b1853-e192-4661-a0f9-7c7c396f6f49/images/f0b5a0e4-ee5d-44a7-ba07-08285791368a/7ed43974-1039-4a68-a8b3-321e7594fe4c
file format: qcow2
virtual size: 430G (461708984320 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false
[g.cecchi@ovmsrv07 ~]$

Based on another command I learnt from another thread, this is what I get
if I check the disk:

[g.cecchi@ovmsrv07 ~]$ sudo qemu-img check
/rhev/data-center/588237b8-0031-02f6-035d-0136/900b1853-e192-4661-a0f9-7c7c396f6f49/images/f0b5a0e4-ee5d-44a7-ba07-08285791368a/7ed43974-1039-4a68-a8b3-321e7594fe4c
Leaked cluster 4013995 refcount=1 reference=0
Leaked cluster 4013996 refcount=1 reference=0
Leaked cluster 4013997 refcount=1 reference=0

... many lines of this type ...

Leaked cluster 6275183 refcount=1 reference=0
Leaked cluster 6275184 refcount=1 reference=0
Leaked cluster 6275185 refcount=1 reference=0

57506 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
6599964/7045120 = 93.68% allocated, 6.30% fragmented, 0.00% compressed
clusters
Image end offset: 436986380288

Can it help in any way to shutdown the VM to unlock the disk?

Thanks,
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-20 Thread Gianluca Cecchi
On Mon, Feb 20, 2017 at 10:51 PM, Gianluca Cecchi  wrote:

> On Mon, Feb 20, 2017 at 8:46 PM, Fred Rolland  wrote:
>
>> Can you please send the whole logs ? (Engine, vdsm and sanlock)
>>
>>
> vdsm.log.1.xz:
> https://drive.google.com/file/d/0BwoPbcrMv8mvWTViWEUtNjRtLTg/
> view?usp=sharing
>
> sanlock.log
> https://drive.google.com/file/d/0BwoPbcrMv8mvcVM4YzZ4aUZLYVU/
> view?usp=sharing
>
> engine.log (gzip format);
> https://drive.google.com/file/d/0BwoPbcrMv8mvdW80RlFIYkpzenc/
> view?usp=sharing
>
> Thanks,
> Gianluca
>
>
I didn't say that size of disk is 430Gb and target storage domain is 1Tb,
almost empty (950Gb free)
I received a message about problems from the storage where the the disk is
and so I'm trying to move it so that I can put under maintenance the
original one and see.
The errors seem about destination creation of volume and not source...
thanks,
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-20 Thread Gianluca Cecchi
On Mon, Feb 20, 2017 at 8:46 PM, Fred Rolland  wrote:

> Can you please send the whole logs ? (Engine, vdsm and sanlock)
>
>
vdsm.log.1.xz:
https://drive.google.com/file/d/0BwoPbcrMv8mvWTViWEUtNjRtLTg/view?usp=sharing

sanlock.log
https://drive.google.com/file/d/0BwoPbcrMv8mvcVM4YzZ4aUZLYVU/view?usp=sharing

engine.log (gzip format);
https://drive.google.com/file/d/0BwoPbcrMv8mvdW80RlFIYkpzenc/view?usp=sharing

Thanks,
Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-20 Thread Fred Rolland
Can you please send the whole logs ? (Engine, vdsm and sanlock)

On Mon, Feb 20, 2017 at 4:49 PM, Gianluca Cecchi 
wrote:

> Hello,
> I'm trying to move a disk from one storage domain A to another B in oVirt
> 4.1
> The corresponding VM is powered on in the mean time
>
> When executing the action, there was already in place a disk move from
> storage domain C to A (this move was for a disk of a powered off VM and
> then completed ok)
> I got this in events of webadmin gui for the failed move A -> B:
>
> Feb 20, 2017 2:42:00 PM Failed to complete snapshot 'Auto-generated for
> Live Storage Migration' creation for VM 'dbatest6'.
> Feb 20, 2017 2:40:51 PM VDSM ovmsrv06 command HSMGetAllTasksStatusesVDS
> failed: Error creating a new volume
> Feb 20, 2017 2:40:51 PM Snapshot 'Auto-generated for Live Storage
> Migration' creation for VM 'dbatest6' was initiated by admin@internal-authz.
>
>
> And in relevant vdsm.log of referred host ovmsrv06
>
> 2017-02-20 14:41:44,899 ERROR (tasks/8) [storage.Volume] Unexpected error
> (volume:1087)
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/volume.py", line 1081, in create
> cls.newVolumeLease(metaId, sdUUID, volUUID)
>   File "/usr/share/vdsm/storage/volume.py", line 1361, in newVolumeLease
> return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
>   File "/usr/share/vdsm/storage/blockVolume.py", line 310, in
> newVolumeLease
> sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)])
> SanlockException: (-202, 'Sanlock resource init failure', 'Sanlock
> exception')
> 2017-02-20 14:41:44,900 ERROR (tasks/8) [storage.TaskManager.Task]
> (Task='d694b892-b078-4d86-a035-427ee4fb3b13') Unexpected error (task:870)
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 877, in _run
> return fn(*args, **kargs)
>   File "/usr/share/vdsm/storage/task.py", line 333, in run
> return self.cmd(*self.argslist, **self.argsdict)
>   File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line
> 79, in wrapper
> return method(self, *args, **kwargs)
>   File "/usr/share/vdsm/storage/sp.py", line 1929, in createVolume
> initialSize=initialSize)
>   File "/usr/share/vdsm/storage/sd.py", line 762, in createVolume
> initialSize=initialSize)
>   File "/usr/share/vdsm/storage/volume.py", line 1089, in create
> (volUUID, e))
> VolumeCreationError: Error creating a new volume: (u"Volume creation
> d0d938bd-1479-49cb-93fb-85b6a32d6cb4 failed: (-202, 'Sanlock resource
> init failure', 'Sanlock exception')",)
> 2017-02-20 14:41:44,941 INFO  (tasks/8) [storage.Volume] Metadata rollback
> for sdUUID=900b1853-e192-4661-a0f9-7c7c396f6f49 offs=8 (blockVolume:448)
>
>
> Was the error generated due to the other migration still in progress?
> Is there a limit of concurrent migrations from/to a particular storage
> domain?
>
> Now I would like to retry, but I see that the disk is in state locked with
> hourglass.
> The autogenerated snapshot of the failed action was apparently removed
> with success as I don't see it.
>
> How can I proceed to move the disk?
>
> Thanks in advance,
> Gianluca
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users