subject:"Re\: \[ovirt\-devel\] \[ OST Failure Report \] \[ oVirt 4.2 \] \[ 2018\-04\-04 \] \[006_migrations.prepare_migration_attachments

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-25 Thread Dan Kenigsberg

On Wed, Apr 25, 2018 at 7:20 PM, Ravi Shankar Nori  wrote:
>
>
> On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina  wrote:
>>
>>
>>
>> On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg  wrote:
>>>
>>> On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori 
>>> wrote:
>>> >
>>> >
>>> > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
>>> > wrote:
>>> >>
>>> >> Ravi's patch is in, but a similar problem remains, and the test cannot
>>> >> be put back into its place.
>>> >>
>>> >> It seems that while Vdsm was taken down, a couple of getCapsAsync
>>> >> requests queued up. At one point, the host resumed its connection,
>>> >> before the requests have been cleared of the queue. After the host is
>>> >> up, the following tests resume, and at a pseudorandom point in time,
>>> >> an old getCapsAsync request times out and kills our connection.
>>> >>
>>> >> I believe that as long as ANY request is on flight, the monitoring
>>> >> lock should not be released, and the host should not be declared as
>>> >> up.
>>> >>
>>> >>
>>> >
>>> >
>>> > Hi Dan,
>>> >
>>> > Can I have the link to the job on jenkins so I can look at the logs
>>>
>>> We disabled a network test that started failing after getCapsAsync was
>>> merged.
>>> Please own its re-introduction to OST:
>>> https://gerrit.ovirt.org/#/c/90264/
>>>
>>> Its most recent failure
>>> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>>> has been discussed by Alona and Piotr over IRC.
>>
>>
>> So https://bugzilla.redhat.com/1571768 was created to cover this issue
>> discovered during Alona's and Piotr's conversation. But after further
>> discussion we have found out that this issue is not related to non-blocking
>> thread changes in engine 4.2 and this behavior exists from beginning of
>> vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along
>> with other locking changes he already posted to see if they will help
>> network OST to succeed.
>>
>> But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix
>> that on master and let's see if it doesn't introduce any regressions. If
>> not, then we can backport to 4.2.4.
>>
>>
>>
>> --
>> Martin Perina
>> Associate Manager, Software Engineering
>> Red Hat Czech s.r.o.
>
>
> Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the OST
> issue with enabling 006_migrations.prepare_migration_attachments_ipv6.
>
> I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back
> 006_migrations.prepare_migration_attachments_ipv6 [3]  and the jobs
> succeeded thrice [4][5][6]
>
> [1] https://gerrit.ovirt.org/#/c/90646/
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768
> [3] https://gerrit.ovirt.org/#/c/90264/
> [4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/
> [5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/
> [6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/

Eyal, Gal: would you please take the test back?
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-25 Thread Dan Kenigsberg

On Wed, Apr 25, 2018 at 8:15 PM, Piotr Kliczewski  wrote:
>
>
> śr., 25 kwi 2018, 19:07 użytkownik Martin Perina 
> napisał:
>>
>> Ravi/Piotr, so what's the connection between non-blocking threads,
>> jsonrpc-java connection closing and failing this network test? Does it mean
>> that non-blocking threads change just revealed the jsonrpc-java issue which
>> we haven't noticed before?
>> And did the test really works with code prior to non-blocking threads
>> changes and we are missing something else?
>
>
> I think that the test found something not related to non-blocking threads.
> This behavior was in the code since the beginning.

ipv6 migration test is quite old too

commit dea714ba85d5c0b77859f504bb80840275277e68
Author: Ondřej Svoboda 
Date:   Tue Mar 14 11:21:53 2017 +0100

Re-enable IPv6 migration.

and vdsm_recovery test is not so young:

commit 315cc100bc9bc0d81c01728a3576a10bfa576ac4
Author: Milan Zamazal 
Date:   Fri Sep 23 10:08:47 2016 +0200

vdsm_recovery test
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-25 Thread Piotr Kliczewski

śr., 25 kwi 2018, 19:07 użytkownik Martin Perina 
napisał:

> Ravi/Piotr, so what's the connection between non-blocking threads,
> jsonrpc-java connection closing and failing this network test? Does it mean
> that non-blocking threads change just revealed the jsonrpc-java issue which
> we haven't noticed before?
> And did the test really works with code prior to non-blocking threads
> changes and we are missing something else?
>

I think that the test found something not related to non-blocking threads.
This behavior was in the code since the beginning.


>
> On Wed, 25 Apr 2018, 18:21 Ravi Shankar Nori,  wrote:
>
>>
>>
>> On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina 
>> wrote:
>>
>>>
>>>
>>> On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg 
>>> wrote:
>>>
 On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori 
 wrote:
 >
 >
 > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
 wrote:
 >>
 >> Ravi's patch is in, but a similar problem remains, and the test
 cannot
 >> be put back into its place.
 >>
 >> It seems that while Vdsm was taken down, a couple of getCapsAsync
 >> requests queued up. At one point, the host resumed its connection,
 >> before the requests have been cleared of the queue. After the host is
 >> up, the following tests resume, and at a pseudorandom point in time,
 >> an old getCapsAsync request times out and kills our connection.
 >>
 >> I believe that as long as ANY request is on flight, the monitoring
 >> lock should not be released, and the host should not be declared as
 >> up.
 >>
 >>
 >
 >
 > Hi Dan,
 >
 > Can I have the link to the job on jenkins so I can look at the logs

 We disabled a network test that started failing after getCapsAsync was
 merged.
 Please own its re-introduction to OST:
 https://gerrit.ovirt.org/#/c/90264/

 Its most recent failure

 http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
 has been discussed by Alona and Piotr over IRC.

>>>
>>> So https://bugzilla.redhat.com/1571768 was created to cover this
>>> issue discovered during Alona's and Piotr's conversation. But after
>>> further discussion we have found out that this issue is not related to
>>> non-blocking thread changes in engine 4.2 and this behavior exists from
>>> beginning of vdsm-jsonrpc-java. Ravi will continue verify the fix for
>>> BZ1571768 along with other locking changes he already posted to see if they
>>> will help network OST to succeed.
>>>
>>> But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix
>>> that on master and let's see if it doesn't introduce any regressions. If
>>> not, then we can backport to 4.2.4.
>>>
>>>
>>>
>>> --
>>> Martin Perina
>>> Associate Manager, Software Engineering
>>> Red Hat Czech s.r.o.
>>>
>>
>> Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the
>> OST issue with enabling 006_migrations.prepare_migration_attachments_ipv6.
>>
>> I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back
>> 006_migrations.prepare_migration_attachments_ipv6 [3]  and the jobs
>> succeeded thrice [4][5][6]
>>
>> [1] https://gerrit.ovirt.org/#/c/90646/
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768
>> [3] https://gerrit.ovirt.org/#/c/90264/
>> [4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/
>> [5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/
>> [6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/
>>
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-25 Thread Martin Perina

Ravi/Piotr, so what's the connection between non-blocking threads,
jsonrpc-java connection closing and failing this network test? Does it mean
that non-blocking threads change just revealed the jsonrpc-java issue which
we haven't noticed before?
And did the test really works with code prior to non-blocking threads
changes and we are missing something else?


On Wed, 25 Apr 2018, 18:21 Ravi Shankar Nori,  wrote:

>
>
> On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina 
> wrote:
>
>>
>>
>> On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg 
>> wrote:
>>
>>> On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori 
>>> wrote:
>>> >
>>> >
>>> > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
>>> wrote:
>>> >>
>>> >> Ravi's patch is in, but a similar problem remains, and the test cannot
>>> >> be put back into its place.
>>> >>
>>> >> It seems that while Vdsm was taken down, a couple of getCapsAsync
>>> >> requests queued up. At one point, the host resumed its connection,
>>> >> before the requests have been cleared of the queue. After the host is
>>> >> up, the following tests resume, and at a pseudorandom point in time,
>>> >> an old getCapsAsync request times out and kills our connection.
>>> >>
>>> >> I believe that as long as ANY request is on flight, the monitoring
>>> >> lock should not be released, and the host should not be declared as
>>> >> up.
>>> >>
>>> >>
>>> >
>>> >
>>> > Hi Dan,
>>> >
>>> > Can I have the link to the job on jenkins so I can look at the logs
>>>
>>> We disabled a network test that started failing after getCapsAsync was
>>> merged.
>>> Please own its re-introduction to OST:
>>> https://gerrit.ovirt.org/#/c/90264/
>>>
>>> Its most recent failure
>>> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>>> has been discussed by Alona and Piotr over IRC.
>>>
>>
>> So https://bugzilla.redhat.com/1571768 was created to cover this issue
>> discovered during Alona's and Piotr's conversation. But after further
>> discussion we have found out that this issue is not related to non-blocking
>> thread changes in engine 4.2 and this behavior exists from beginning of
>> vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along
>> with other locking changes he already posted to see if they will help
>> network OST to succeed.
>>
>> But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix
>> that on master and let's see if it doesn't introduce any regressions. If
>> not, then we can backport to 4.2.4.
>>
>>
>>
>> --
>> Martin Perina
>> Associate Manager, Software Engineering
>> Red Hat Czech s.r.o.
>>
>
> Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the
> OST issue with enabling 006_migrations.prepare_migration_attachments_ipv6.
>
> I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back
> 006_migrations.prepare_migration_attachments_ipv6 [3]  and the jobs
> succeeded thrice [4][5][6]
>
> [1] https://gerrit.ovirt.org/#/c/90646/
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768
> [3] https://gerrit.ovirt.org/#/c/90264/
> [4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/
> [5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/
> [6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-25 Thread Ravi Shankar Nori

On Wed, Apr 25, 2018 at 10:57 AM, Martin Perina  wrote:

>
>
> On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg  wrote:
>
>> On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori 
>> wrote:
>> >
>> >
>> > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
>> wrote:
>> >>
>> >> Ravi's patch is in, but a similar problem remains, and the test cannot
>> >> be put back into its place.
>> >>
>> >> It seems that while Vdsm was taken down, a couple of getCapsAsync
>> >> requests queued up. At one point, the host resumed its connection,
>> >> before the requests have been cleared of the queue. After the host is
>> >> up, the following tests resume, and at a pseudorandom point in time,
>> >> an old getCapsAsync request times out and kills our connection.
>> >>
>> >> I believe that as long as ANY request is on flight, the monitoring
>> >> lock should not be released, and the host should not be declared as
>> >> up.
>> >>
>> >>
>> >
>> >
>> > Hi Dan,
>> >
>> > Can I have the link to the job on jenkins so I can look at the logs
>>
>> We disabled a network test that started failing after getCapsAsync was
>> merged.
>> Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/9
>> 0264/
>>
>> Its most recent failure
>> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>> has been discussed by Alona and Piotr over IRC.
>>
>
> So https://bugzilla.redhat.com/1571768 was created to cover this issue
> discovered during Alona's and Piotr's conversation. But after further
> discussion we have found out that this issue is not related to non-blocking
> thread changes in engine 4.2 and this behavior exists from beginning of
> vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along
> with other locking changes he already posted to see if they will help
> network OST to succeed.
>
> But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix
> that on master and let's see if it doesn't introduce any regressions. If
> not, then we can backport to 4.2.4.
>
>
>
> --
> Martin Perina
> Associate Manager, Software Engineering
> Red Hat Czech s.r.o.
>

Posted a vdsm-jsonrpc-java patch [1] for BZ 1571768 [2] which fixes the OST
issue with enabling 006_migrations.prepare_migration_attachments_ipv6.

I ran OST with the vdsm-jsonrpc-java patch [1] and the patch to add back
006_migrations.prepare_migration_attachments_ipv6 [3]  and the jobs
succeeded thrice [4][5][6]

[1] https://gerrit.ovirt.org/#/c/90646/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1571768
[3] https://gerrit.ovirt.org/#/c/90264/
[4] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2643/
[5] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2644/
[6] http://jenkins.ovirt.org/job/ovirt-system-tests_manual/2645/
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-25 Thread Dan Kenigsberg

On Wed, Apr 25, 2018 at 5:57 PM, Martin Perina  wrote:
>
>
> On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg  wrote:
>>
>> On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori 
>> wrote:
>> >
>> >
>> > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
>> > wrote:
>> >>
>> >> Ravi's patch is in, but a similar problem remains, and the test cannot
>> >> be put back into its place.
>> >>
>> >> It seems that while Vdsm was taken down, a couple of getCapsAsync
>> >> requests queued up. At one point, the host resumed its connection,
>> >> before the requests have been cleared of the queue. After the host is
>> >> up, the following tests resume, and at a pseudorandom point in time,
>> >> an old getCapsAsync request times out and kills our connection.
>> >>
>> >> I believe that as long as ANY request is on flight, the monitoring
>> >> lock should not be released, and the host should not be declared as
>> >> up.
>> >>
>> >>
>> >
>> >
>> > Hi Dan,
>> >
>> > Can I have the link to the job on jenkins so I can look at the logs
>>
>> We disabled a network test that started failing after getCapsAsync was
>> merged.
>> Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/
>>
>> Its most recent failure
>> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>> has been discussed by Alona and Piotr over IRC.
>
>
> So https://bugzilla.redhat.com/1571768 was created to cover this issue
> discovered during Alona's and Piotr's conversation. But after further
> discussion we have found out that this issue is not related to non-blocking
> thread changes in engine 4.2 and this behavior exists from beginning of
> vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along
> with other locking changes he already posted to see if they will help
> network OST to succeed.
>
> But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that
> on master and let's see if it doesn't introduce any regressions. If not,
> then we can backport to 4.2.4.

I sense as if there is a regression in connection management, that
coincided with the introduction of async monitoring.
I am not alone: Gal Ben Haim was reluctant to take our test back.

Do you think that it is now safe to take it in
https://gerrit.ovirt.org/#/c/90264/ ?
I'd appreciate your support there. I don't want any test to be skipped
without a very good reason.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-25 Thread Martin Perina

On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg  wrote:

> On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori 
> wrote:
> >
> >
> > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
> wrote:
> >>
> >> Ravi's patch is in, but a similar problem remains, and the test cannot
> >> be put back into its place.
> >>
> >> It seems that while Vdsm was taken down, a couple of getCapsAsync
> >> requests queued up. At one point, the host resumed its connection,
> >> before the requests have been cleared of the queue. After the host is
> >> up, the following tests resume, and at a pseudorandom point in time,
> >> an old getCapsAsync request times out and kills our connection.
> >>
> >> I believe that as long as ANY request is on flight, the monitoring
> >> lock should not be released, and the host should not be declared as
> >> up.
> >>
> >>
> >
> >
> > Hi Dan,
> >
> > Can I have the link to the job on jenkins so I can look at the logs
>
> We disabled a network test that started failing after getCapsAsync was
> merged.
> Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/
>
> Its most recent failure
> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
> has been discussed by Alona and Piotr over IRC.
>

So https://bugzilla.redhat.com/1571768 was created to cover this issue
discovered during Alona's and Piotr's conversation. But after further
discussion we have found out that this issue is not related to non-blocking
thread changes in engine 4.2 and this behavior exists from beginning of
vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along
with other locking changes he already posted to see if they will help
network OST to succeed.

But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that
on master and let's see if it doesn't introduce any regressions. If not,
then we can backport to 4.2.4.

-- 
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Dan Kenigsberg

On Tue, Apr 24, 2018 at 10:27 PM, Ravi Shankar Nori  wrote:
>
>
> On Tue, Apr 24, 2018 at 10:46 AM, Ravi Shankar Nori 
> wrote:
>>
>>
>>
>> On Tue, Apr 24, 2018 at 10:29 AM, Dan Kenigsberg 
>> wrote:
>>>
>>> On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori 
>>> wrote:
>>> >
>>> >
>>> > On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg 
>>> > wrote:
>>> >>
>>> >> On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori 
>>> >> wrote:
>>> >> >
>>> >> >
>>> >> > On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina 
>>> >> > wrote:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori
>>> >> >> 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg
>>> >> >>> 
>>> >> >>> wrote:
>>> >> 
>>> >>  Ravi's patch is in, but a similar problem remains, and the test
>>> >>  cannot
>>> >>  be put back into its place.
>>> >> 
>>> >>  It seems that while Vdsm was taken down, a couple of getCapsAsync
>>> >>  requests queued up. At one point, the host resumed its
>>> >>  connection,
>>> >>  before the requests have been cleared of the queue. After the
>>> >>  host is
>>> >>  up, the following tests resume, and at a pseudorandom point in
>>> >>  time,
>>> >>  an old getCapsAsync request times out and kills our connection.
>>> >> 
>>> >>  I believe that as long as ANY request is on flight, the
>>> >>  monitoring
>>> >>  lock should not be released, and the host should not be declared
>>> >>  as
>>> >>  up.
>>> >>
>>> >> Would you relate to this analysis ^^^ ?
>>> >>
>>> >
>>> > The HostMonitoring lock issue has been fixed by
>>> > https://gerrit.ovirt.org/#/c/90189/
>>>
>>> Is there still a chance that a host moves to Up while former
>>> getCapsAsync request are still in-flight?
>>>
>>
>> Should not happen. Is there a way to execute/reproduce the failing test on
>> Dev env?
>>
>>>
>>> >
>>> >>
>>> >> 
>>> >> 
>>> >> >>>
>>> >> >>>
>>> >> >>> Hi Dan,
>>> >> >>>
>>> >> >>> Can I have the link to the job on jenkins so I can look at the
>>> >> >>> logs
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>>> >> >>
>>> >> >
>>> >> >
>>> >> > From the logs the only VDS lock that is being released twice is
>>> >> > VDS_FENCE
>>> >> > lock. Opened a BZ [1] for it. Will post a fix
>>> >> >
>>> >> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300
>>> >>
>>> >> Can this possibly cause a surprise termination of host connection?
>>> >
>>> >
>>> > Not sure, from the logs VDS_FENCE is the only other VDS lock that is
>>> > being
>>> > released
>>
>>
>
> Would be helpful if I can get the exact flow that is failing and also the
> steps if any needed to reproduce the issue

By now the logs of
http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
have been garbage-collected, so I cannot point you to the location in
the logs. Maybe Alona has a local copy. According to her analysis the
issue manifest itself when setupNetworks follows vdsm restart.

Have you tried running OST with prepare_migration_attachments_ipv6
reintroduced? It should always pass.

Regards,
Dan.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Ravi Shankar Nori

On Tue, Apr 24, 2018 at 10:46 AM, Ravi Shankar Nori 
wrote:

>
>
> On Tue, Apr 24, 2018 at 10:29 AM, Dan Kenigsberg 
> wrote:
>
>> On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori 
>> wrote:
>> >
>> >
>> > On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg 
>> wrote:
>> >>
>> >> On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori 
>> >> wrote:
>> >> >
>> >> >
>> >> > On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina 
>> >> > wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <
>> rn...@redhat.com>
>> >> >> wrote:
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg > >
>> >> >>> wrote:
>> >> 
>> >>  Ravi's patch is in, but a similar problem remains, and the test
>> >>  cannot
>> >>  be put back into its place.
>> >> 
>> >>  It seems that while Vdsm was taken down, a couple of getCapsAsync
>> >>  requests queued up. At one point, the host resumed its connection,
>> >>  before the requests have been cleared of the queue. After the
>> host is
>> >>  up, the following tests resume, and at a pseudorandom point in
>> time,
>> >>  an old getCapsAsync request times out and kills our connection.
>> >> 
>> >>  I believe that as long as ANY request is on flight, the monitoring
>> >>  lock should not be released, and the host should not be declared
>> as
>> >>  up.
>> >>
>> >> Would you relate to this analysis ^^^ ?
>> >>
>> >
>> > The HostMonitoring lock issue has been fixed by
>> > https://gerrit.ovirt.org/#/c/90189/
>>
>> Is there still a chance that a host moves to Up while former
>> getCapsAsync request are still in-flight?
>>
>>
> Should not happen. Is there a way to execute/reproduce the failing test on
> Dev env?
>
>
>> >
>> >>
>> >> 
>> >> 
>> >> >>>
>> >> >>>
>> >> >>> Hi Dan,
>> >> >>>
>> >> >>> Can I have the link to the job on jenkins so I can look at the logs
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-che
>> ck-patch/346/
>> >> >>
>> >> >
>> >> >
>> >> > From the logs the only VDS lock that is being released twice is
>> >> > VDS_FENCE
>> >> > lock. Opened a BZ [1] for it. Will post a fix
>> >> >
>> >> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300
>> >>
>> >> Can this possibly cause a surprise termination of host connection?
>> >
>> >
>> > Not sure, from the logs VDS_FENCE is the only other VDS lock that is
>> being
>> > released
>>
>
>
Would be helpful if I can get the exact flow that is failing and also the
steps if any needed to reproduce the issue
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Ravi Shankar Nori

On Tue, Apr 24, 2018 at 10:29 AM, Dan Kenigsberg  wrote:

> On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori 
> wrote:
> >
> >
> > On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg 
> wrote:
> >>
> >> On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori 
> >> wrote:
> >> >
> >> >
> >> > On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina 
> >> > wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori  >
> >> >> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
> >> >>> wrote:
> >> 
> >>  Ravi's patch is in, but a similar problem remains, and the test
> >>  cannot
> >>  be put back into its place.
> >> 
> >>  It seems that while Vdsm was taken down, a couple of getCapsAsync
> >>  requests queued up. At one point, the host resumed its connection,
> >>  before the requests have been cleared of the queue. After the host
> is
> >>  up, the following tests resume, and at a pseudorandom point in
> time,
> >>  an old getCapsAsync request times out and kills our connection.
> >> 
> >>  I believe that as long as ANY request is on flight, the monitoring
> >>  lock should not be released, and the host should not be declared as
> >>  up.
> >>
> >> Would you relate to this analysis ^^^ ?
> >>
> >
> > The HostMonitoring lock issue has been fixed by
> > https://gerrit.ovirt.org/#/c/90189/
>
> Is there still a chance that a host moves to Up while former
> getCapsAsync request are still in-flight?
>
>
Should not happen. Is there a way to execute/reproduce the failing test on
Dev env?


> >
> >>
> >> 
> >> 
> >> >>>
> >> >>>
> >> >>> Hi Dan,
> >> >>>
> >> >>> Can I have the link to the job on jenkins so I can look at the logs
> >> >>
> >> >>
> >> >>
> >> >> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-
> check-patch/346/
> >> >>
> >> >
> >> >
> >> > From the logs the only VDS lock that is being released twice is
> >> > VDS_FENCE
> >> > lock. Opened a BZ [1] for it. Will post a fix
> >> >
> >> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300
> >>
> >> Can this possibly cause a surprise termination of host connection?
> >
> >
> > Not sure, from the logs VDS_FENCE is the only other VDS lock that is
> being
> > released
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Dan Kenigsberg

On Tue, Apr 24, 2018 at 5:09 PM, Ravi Shankar Nori  wrote:
>
>
> On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg  wrote:
>>
>> On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori 
>> wrote:
>> >
>> >
>> > On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina 
>> > wrote:
>> >>
>> >>
>> >>
>> >> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori 
>> >> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
>> >>> wrote:
>> 
>>  Ravi's patch is in, but a similar problem remains, and the test
>>  cannot
>>  be put back into its place.
>> 
>>  It seems that while Vdsm was taken down, a couple of getCapsAsync
>>  requests queued up. At one point, the host resumed its connection,
>>  before the requests have been cleared of the queue. After the host is
>>  up, the following tests resume, and at a pseudorandom point in time,
>>  an old getCapsAsync request times out and kills our connection.
>> 
>>  I believe that as long as ANY request is on flight, the monitoring
>>  lock should not be released, and the host should not be declared as
>>  up.
>>
>> Would you relate to this analysis ^^^ ?
>>
>
> The HostMonitoring lock issue has been fixed by
> https://gerrit.ovirt.org/#/c/90189/

Is there still a chance that a host moves to Up while former
getCapsAsync request are still in-flight?

>
>>
>> 
>> 
>> >>>
>> >>>
>> >>> Hi Dan,
>> >>>
>> >>> Can I have the link to the job on jenkins so I can look at the logs
>> >>
>> >>
>> >>
>> >> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>> >>
>> >
>> >
>> > From the logs the only VDS lock that is being released twice is
>> > VDS_FENCE
>> > lock. Opened a BZ [1] for it. Will post a fix
>> >
>> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300
>>
>> Can this possibly cause a surprise termination of host connection?
>
>
> Not sure, from the logs VDS_FENCE is the only other VDS lock that is being
> released
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Ravi Shankar Nori

On Tue, Apr 24, 2018 at 9:47 AM, Dan Kenigsberg  wrote:

> On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori 
> wrote:
> >
> >
> > On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina 
> wrote:
> >>
> >>
> >>
> >> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori 
> >> wrote:
> >>>
> >>>
> >>>
> >>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
> >>> wrote:
> 
>  Ravi's patch is in, but a similar problem remains, and the test cannot
>  be put back into its place.
> 
>  It seems that while Vdsm was taken down, a couple of getCapsAsync
>  requests queued up. At one point, the host resumed its connection,
>  before the requests have been cleared of the queue. After the host is
>  up, the following tests resume, and at a pseudorandom point in time,
>  an old getCapsAsync request times out and kills our connection.
> 
>  I believe that as long as ANY request is on flight, the monitoring
>  lock should not be released, and the host should not be declared as
>  up.
>
> Would you relate to this analysis ^^^ ?
>
>
The HostMonitoring lock issue has been fixed by
https://gerrit.ovirt.org/#/c/90189/


> 
> 
> >>>
> >>>
> >>> Hi Dan,
> >>>
> >>> Can I have the link to the job on jenkins so I can look at the logs
> >>
> >>
> >> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-
> check-patch/346/
> >>
> >
> >
> > From the logs the only VDS lock that is being released twice is VDS_FENCE
> > lock. Opened a BZ [1] for it. Will post a fix
> >
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300
>
> Can this possibly cause a surprise termination of host connection?
>

Not sure, from the logs VDS_FENCE is the only other VDS lock that is being
released
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Dan Kenigsberg

On Tue, Apr 24, 2018 at 4:36 PM, Ravi Shankar Nori  wrote:
>
>
> On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina  wrote:
>>
>>
>>
>> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori 
>> wrote:
>>>
>>>
>>>
>>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
>>> wrote:

 Ravi's patch is in, but a similar problem remains, and the test cannot
 be put back into its place.

 It seems that while Vdsm was taken down, a couple of getCapsAsync
 requests queued up. At one point, the host resumed its connection,
 before the requests have been cleared of the queue. After the host is
 up, the following tests resume, and at a pseudorandom point in time,
 an old getCapsAsync request times out and kills our connection.

 I believe that as long as ANY request is on flight, the monitoring
 lock should not be released, and the host should not be declared as
 up.

Would you relate to this analysis ^^^ ?

>>>
>>>
>>> Hi Dan,
>>>
>>> Can I have the link to the job on jenkins so I can look at the logs
>>
>>
>> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>>
>
>
> From the logs the only VDS lock that is being released twice is VDS_FENCE
> lock. Opened a BZ [1] for it. Will post a fix
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

Can this possibly cause a surprise termination of host connection?
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Ravi Shankar Nori

On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina  wrote:

>
>
> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori 
> wrote:
>
>>
>>
>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg 
>> wrote:
>>
>>> Ravi's patch is in, but a similar problem remains, and the test cannot
>>> be put back into its place.
>>>
>>> It seems that while Vdsm was taken down, a couple of getCapsAsync
>>> requests queued up. At one point, the host resumed its connection,
>>> before the requests have been cleared of the queue. After the host is
>>> up, the following tests resume, and at a pseudorandom point in time,
>>> an old getCapsAsync request times out and kills our connection.
>>>
>>> I believe that as long as ANY request is on flight, the monitoring
>>> lock should not be released, and the host should not be declared as
>>> up.
>>>
>>>
>>>
>>
>> Hi Dan,
>>
>> Can I have the link to the job on jenkins so I can look at the logs
>>
>
> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
> 
>
>

>From the logs the only VDS lock that is being released twice is VDS_FENCE
lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300


>
>>
>>
>>> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori 
>>> wrote:
>>> > This [1] should fix the multiple release lock issue
>>> >
>>> > [1] https://gerrit.ovirt.org/#/c/90077/
>>> >
>>> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori 
>>> wrote:
>>> >>
>>> >> Working on a patch will post a fix
>>> >>
>>> >> Thanks
>>> >>
>>> >> Ravi
>>> >>
>>> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan 
>>> wrote:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>> >>> responsible for the mess.
>>> >>>
>>> >>> - 08:29:47 - engine loses connectivity to host
>>> >>> 'lago-basic-suite-4-2-host-0'.
>>> >>>
>>> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the
>>> host
>>> >>> (unsuccessfully).
>>> >>>
>>> >>>  * before each "getCapabilitiesAsync" the monitoring lock is
>>> taken
>>> >>> (VdsManager,refreshImpl)
>>> >>>
>>> >>>  * "getCapabilitiesAsync" immediately fails and throws
>>> >>> 'VDSNetworkException: java.net.ConnectException: Connection
>>> refused'. The
>>> >>> exception is caught by
>>> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
>>> >>> 'onFailure' of the callback and re-throws the exception.
>>> >>>
>>> >>>  catch (Throwable t) {
>>> >>> getParameters().getCallback().onFailure(t);
>>> >>> throw t;
>>> >>>  }
>>> >>>
>>> >>> * The 'onFailure' of the callback releases the "monitoringLock"
>>> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
>>> >>> lockManager.releaseLock(monitoringLock);')
>>> >>>
>>> >>> * 'VdsManager,refreshImpl' catches the network exception, marks
>>> >>> 'releaseLock = true' and tries to release the already released lock.
>>> >>>
>>> >>>   The following warning is printed to the log -
>>> >>>
>>> >>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
>>> release
>>> >>> exclusive lock which does not exist, lock key:
>>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>> >>>
>>> >>>
>>> >>> - 08:30:51 a successful getCapabilitiesAsync is sent.
>>> >>>
>>> >>> - 08:32:55 - The failing test starts (Setup Networks for setting
>>> ipv6).
>>> >>>
>>> >>>
>>> >>> * SetupNetworks takes the monitoring lock.
>>> >>>
>>> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
>>> >>> from 4 minutes ago from its queue and prints a VDSNetworkException:
>>> Vds
>>> >>> timeout occured.
>>> >>>
>>> >>>   * When the first request is removed from the queue
>>> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked
>>> (for the
>>> >>> second time) -> monitoring lock is released (the lock taken by the
>>> >>> SetupNetworks!).
>>> >>>
>>> >>>   * The other requests removed from the queue also try to
>>> release the
>>> >>> monitoring lock, but there is nothing to release.
>>> >>>
>>> >>>   * The following warning log is printed -
>>> >>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
>>> release
>>> >>> exclusive lock which does not exist, lock key:
>>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>> >>>
>>> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
>>> started.
>>> >>> Why? I'm not 100% sure but I guess the late processing of the
>>> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and
>>> the
>>> >>> late + mupltiple processing of failure is root cause.
>>> >>>
>>> >>>
>>> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
>>> >>> trying to be released three times. Please share your opinion
>>> regarding how
>>> >>> it should be fixed.
>

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Dan Kenigsberg

On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori  wrote:
>
>
> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg  wrote:
>>
>> Ravi's patch is in, but a similar problem remains, and the test cannot
>> be put back into its place.
>>
>> It seems that while Vdsm was taken down, a couple of getCapsAsync
>> requests queued up. At one point, the host resumed its connection,
>> before the requests have been cleared of the queue. After the host is
>> up, the following tests resume, and at a pseudorandom point in time,
>> an old getCapsAsync request times out and kills our connection.
>>
>> I believe that as long as ANY request is on flight, the monitoring
>> lock should not be released, and the host should not be declared as
>> up.
>>
>>
>
>
> Hi Dan,
>
> Can I have the link to the job on jenkins so I can look at the logs

We disabled a network test that started failing after getCapsAsync was merged.
Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/

Its most recent failure
http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
has been discussed by Alona and Piotr over IRC.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Martin Perina

On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori  wrote:

>
>
> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg  wrote:
>
>> Ravi's patch is in, but a similar problem remains, and the test cannot
>> be put back into its place.
>>
>> It seems that while Vdsm was taken down, a couple of getCapsAsync
>> requests queued up. At one point, the host resumed its connection,
>> before the requests have been cleared of the queue. After the host is
>> up, the following tests resume, and at a pseudorandom point in time,
>> an old getCapsAsync request times out and kills our connection.
>>
>> I believe that as long as ANY request is on flight, the monitoring
>> lock should not be released, and the host should not be declared as
>> up.
>>
>>
>>
>
> Hi Dan,
>
> Can I have the link to the job on jenkins so I can look at the logs
>

http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/



>
>
>
>> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori 
>> wrote:
>> > This [1] should fix the multiple release lock issue
>> >
>> > [1] https://gerrit.ovirt.org/#/c/90077/
>> >
>> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori 
>> wrote:
>> >>
>> >> Working on a patch will post a fix
>> >>
>> >> Thanks
>> >>
>> >> Ravi
>> >>
>> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan 
>> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> Looking at the log it seems that the new GetCapabilitiesAsync is
>> >>> responsible for the mess.
>> >>>
>> >>> - 08:29:47 - engine loses connectivity to host
>> >>> 'lago-basic-suite-4-2-host-0'.
>> >>>
>> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host
>> >>> (unsuccessfully).
>> >>>
>> >>>  * before each "getCapabilitiesAsync" the monitoring lock is taken
>> >>> (VdsManager,refreshImpl)
>> >>>
>> >>>  * "getCapabilitiesAsync" immediately fails and throws
>> >>> 'VDSNetworkException: java.net.ConnectException: Connection refused'.
>> The
>> >>> exception is caught by
>> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
>> >>> 'onFailure' of the callback and re-throws the exception.
>> >>>
>> >>>  catch (Throwable t) {
>> >>> getParameters().getCallback().onFailure(t);
>> >>> throw t;
>> >>>  }
>> >>>
>> >>> * The 'onFailure' of the callback releases the "monitoringLock"
>> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
>> >>> lockManager.releaseLock(monitoringLock);')
>> >>>
>> >>> * 'VdsManager,refreshImpl' catches the network exception, marks
>> >>> 'releaseLock = true' and tries to release the already released lock.
>> >>>
>> >>>   The following warning is printed to the log -
>> >>>
>> >>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
>> release
>> >>> exclusive lock which does not exist, lock key:
>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>> >>>
>> >>>
>> >>> - 08:30:51 a successful getCapabilitiesAsync is sent.
>> >>>
>> >>> - 08:32:55 - The failing test starts (Setup Networks for setting
>> ipv6).
>> >>>
>> >>>
>> >>> * SetupNetworks takes the monitoring lock.
>> >>>
>> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
>> >>> from 4 minutes ago from its queue and prints a VDSNetworkException:
>> Vds
>> >>> timeout occured.
>> >>>
>> >>>   * When the first request is removed from the queue
>> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked
>> (for the
>> >>> second time) -> monitoring lock is released (the lock taken by the
>> >>> SetupNetworks!).
>> >>>
>> >>>   * The other requests removed from the queue also try to release
>> the
>> >>> monitoring lock, but there is nothing to release.
>> >>>
>> >>>   * The following warning log is printed -
>> >>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
>> release
>> >>> exclusive lock which does not exist, lock key:
>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>> >>>
>> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
>> started.
>> >>> Why? I'm not 100% sure but I guess the late processing of the
>> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and
>> the
>> >>> late + mupltiple processing of failure is root cause.
>> >>>
>> >>>
>> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
>> >>> trying to be released three times. Please share your opinion
>> regarding how
>> >>> it should be fixed.
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Alona.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg 
>> wrote:
>> 
>>  On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas 
>> wrote:
>> >
>> >
>> >
>> > On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>> >>
>> >> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Ravi Shankar Nori

On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg  wrote:

> Ravi's patch is in, but a similar problem remains, and the test cannot
> be put back into its place.
>
> It seems that while Vdsm was taken down, a couple of getCapsAsync
> requests queued up. At one point, the host resumed its connection,
> before the requests have been cleared of the queue. After the host is
> up, the following tests resume, and at a pseudorandom point in time,
> an old getCapsAsync request times out and kills our connection.
>
> I believe that as long as ANY request is on flight, the monitoring
> lock should not be released, and the host should not be declared as
> up.
>
>
>

Hi Dan,

Can I have the link to the job on jenkins so I can look at the logs



> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori 
> wrote:
> > This [1] should fix the multiple release lock issue
> >
> > [1] https://gerrit.ovirt.org/#/c/90077/
> >
> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori 
> wrote:
> >>
> >> Working on a patch will post a fix
> >>
> >> Thanks
> >>
> >> Ravi
> >>
> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Looking at the log it seems that the new GetCapabilitiesAsync is
> >>> responsible for the mess.
> >>>
> >>> - 08:29:47 - engine loses connectivity to host
> >>> 'lago-basic-suite-4-2-host-0'.
> >>>
> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host
> >>> (unsuccessfully).
> >>>
> >>>  * before each "getCapabilitiesAsync" the monitoring lock is taken
> >>> (VdsManager,refreshImpl)
> >>>
> >>>  * "getCapabilitiesAsync" immediately fails and throws
> >>> 'VDSNetworkException: java.net.ConnectException: Connection refused'.
> The
> >>> exception is caught by
> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
> >>> 'onFailure' of the callback and re-throws the exception.
> >>>
> >>>  catch (Throwable t) {
> >>> getParameters().getCallback().onFailure(t);
> >>> throw t;
> >>>  }
> >>>
> >>> * The 'onFailure' of the callback releases the "monitoringLock"
> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
> >>> lockManager.releaseLock(monitoringLock);')
> >>>
> >>> * 'VdsManager,refreshImpl' catches the network exception, marks
> >>> 'releaseLock = true' and tries to release the already released lock.
> >>>
> >>>   The following warning is printed to the log -
> >>>
> >>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
> release
> >>> exclusive lock which does not exist, lock key:
> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>
> >>>
> >>> - 08:30:51 a successful getCapabilitiesAsync is sent.
> >>>
> >>> - 08:32:55 - The failing test starts (Setup Networks for setting ipv6).
> >>>
> >>>
> >>> * SetupNetworks takes the monitoring lock.
> >>>
> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
> >>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds
> >>> timeout occured.
> >>>
> >>>   * When the first request is removed from the queue
> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for
> the
> >>> second time) -> monitoring lock is released (the lock taken by the
> >>> SetupNetworks!).
> >>>
> >>>   * The other requests removed from the queue also try to release
> the
> >>> monitoring lock, but there is nothing to release.
> >>>
> >>>   * The following warning log is printed -
> >>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
> release
> >>> exclusive lock which does not exist, lock key:
> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>
> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started.
> >>> Why? I'm not 100% sure but I guess the late processing of the
> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and
> the
> >>> late + mupltiple processing of failure is root cause.
> >>>
> >>>
> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
> >>> trying to be released three times. Please share your opinion regarding
> how
> >>> it should be fixed.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Alona.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg 
> wrote:
> 
>  On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
> >
> >
> >
> > On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
> >>
> >> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
> >> Is it still failing?
> >>
> >> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
> >> wrote:
> >>>
> >>> On 7 April 2018 at 00:30, Dan Kenigsberg 
> wrote:
> >>> > No, I am afraid that we have not managed to understand why
> setting
> >>> > and
> >>> > ipv6 addre

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Dan Kenigsberg

On Tue, Apr 24, 2018 at 3:14 PM, Yaniv Kaul  wrote:
>
>
> On Tue, Apr 24, 2018 at 2:00 PM, Dan Kenigsberg  wrote:
>>
>> Ravi's patch is in, but a similar problem remains, and the test cannot
>> be put back into its place.
>>
>> It seems that while Vdsm was taken down, a couple of getCapsAsync
>
>
> - Why is VDSM down?

vdsm_recovery test, I presume.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Yaniv Kaul

On Tue, Apr 24, 2018 at 2:00 PM, Dan Kenigsberg  wrote:

> Ravi's patch is in, but a similar problem remains, and the test cannot
> be put back into its place.
>
> It seems that while Vdsm was taken down, a couple of getCapsAsync
>

- Why is VDSM down?
Y.


> requests queued up. At one point, the host resumed its connection,
> before the requests have been cleared of the queue. After the host is
> up, the following tests resume, and at a pseudorandom point in time,
> an old getCapsAsync request times out and kills our connection.
>
> I believe that as long as ANY request is on flight, the monitoring
> lock should not be released, and the host should not be declared as
> up.
>
>
> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori 
> wrote:
> > This [1] should fix the multiple release lock issue
> >
> > [1] https://gerrit.ovirt.org/#/c/90077/
> >
> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori 
> wrote:
> >>
> >> Working on a patch will post a fix
> >>
> >> Thanks
> >>
> >> Ravi
> >>
> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Looking at the log it seems that the new GetCapabilitiesAsync is
> >>> responsible for the mess.
> >>>
> >>> - 08:29:47 - engine loses connectivity to host
> >>> 'lago-basic-suite-4-2-host-0'.
> >>>
> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host
> >>> (unsuccessfully).
> >>>
> >>>  * before each "getCapabilitiesAsync" the monitoring lock is taken
> >>> (VdsManager,refreshImpl)
> >>>
> >>>  * "getCapabilitiesAsync" immediately fails and throws
> >>> 'VDSNetworkException: java.net.ConnectException: Connection refused'.
> The
> >>> exception is caught by
> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
> >>> 'onFailure' of the callback and re-throws the exception.
> >>>
> >>>  catch (Throwable t) {
> >>> getParameters().getCallback().onFailure(t);
> >>> throw t;
> >>>  }
> >>>
> >>> * The 'onFailure' of the callback releases the "monitoringLock"
> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
> >>> lockManager.releaseLock(monitoringLock);')
> >>>
> >>> * 'VdsManager,refreshImpl' catches the network exception, marks
> >>> 'releaseLock = true' and tries to release the already released lock.
> >>>
> >>>   The following warning is printed to the log -
> >>>
> >>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
> release
> >>> exclusive lock which does not exist, lock key:
> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>
> >>>
> >>> - 08:30:51 a successful getCapabilitiesAsync is sent.
> >>>
> >>> - 08:32:55 - The failing test starts (Setup Networks for setting ipv6).
> >>>
> >>>
> >>> * SetupNetworks takes the monitoring lock.
> >>>
> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
> >>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds
> >>> timeout occured.
> >>>
> >>>   * When the first request is removed from the queue
> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for
> the
> >>> second time) -> monitoring lock is released (the lock taken by the
> >>> SetupNetworks!).
> >>>
> >>>   * The other requests removed from the queue also try to release
> the
> >>> monitoring lock, but there is nothing to release.
> >>>
> >>>   * The following warning log is printed -
> >>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
> release
> >>> exclusive lock which does not exist, lock key:
> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>
> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started.
> >>> Why? I'm not 100% sure but I guess the late processing of the
> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and
> the
> >>> late + mupltiple processing of failure is root cause.
> >>>
> >>>
> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
> >>> trying to be released three times. Please share your opinion regarding
> how
> >>> it should be fixed.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Alona.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg 
> wrote:
> 
>  On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
> >
> >
> >
> > On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
> >>
> >> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
> >> Is it still failing?
> >>
> >> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
> >> wrote:
> >>>
> >>> On 7 April 2018 at 00:30, Dan Kenigsberg 
> wrote:
> >>> > No, I am afraid that we have not managed to understand why
> setting
> >>> > and
> >>> > ipv6 address too the host off the grid. We shall continue
> >

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-24 Thread Dan Kenigsberg

Ravi's patch is in, but a similar problem remains, and the test cannot
be put back into its place.

It seems that while Vdsm was taken down, a couple of getCapsAsync
requests queued up. At one point, the host resumed its connection,
before the requests have been cleared of the queue. After the host is
up, the following tests resume, and at a pseudorandom point in time,
an old getCapsAsync request times out and kills our connection.

I believe that as long as ANY request is on flight, the monitoring
lock should not be released, and the host should not be declared as
up.


On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori  wrote:
> This [1] should fix the multiple release lock issue
>
> [1] https://gerrit.ovirt.org/#/c/90077/
>
> On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori  wrote:
>>
>> Working on a patch will post a fix
>>
>> Thanks
>>
>> Ravi
>>
>> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan  wrote:
>>>
>>> Hi all,
>>>
>>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>> responsible for the mess.
>>>
>>> - 08:29:47 - engine loses connectivity to host
>>> 'lago-basic-suite-4-2-host-0'.
>>>
>>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host
>>> (unsuccessfully).
>>>
>>>  * before each "getCapabilitiesAsync" the monitoring lock is taken
>>> (VdsManager,refreshImpl)
>>>
>>>  * "getCapabilitiesAsync" immediately fails and throws
>>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The
>>> exception is caught by
>>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
>>> 'onFailure' of the callback and re-throws the exception.
>>>
>>>  catch (Throwable t) {
>>> getParameters().getCallback().onFailure(t);
>>> throw t;
>>>  }
>>>
>>> * The 'onFailure' of the callback releases the "monitoringLock"
>>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
>>> lockManager.releaseLock(monitoringLock);')
>>>
>>> * 'VdsManager,refreshImpl' catches the network exception, marks
>>> 'releaseLock = true' and tries to release the already released lock.
>>>
>>>   The following warning is printed to the log -
>>>
>>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release
>>> exclusive lock which does not exist, lock key:
>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>>
>>> - 08:30:51 a successful getCapabilitiesAsync is sent.
>>>
>>> - 08:32:55 - The failing test starts (Setup Networks for setting ipv6).
>>>
>>>
>>> * SetupNetworks takes the monitoring lock.
>>>
>>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
>>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds
>>> timeout occured.
>>>
>>>   * When the first request is removed from the queue
>>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for the
>>> second time) -> monitoring lock is released (the lock taken by the
>>> SetupNetworks!).
>>>
>>>   * The other requests removed from the queue also try to release the
>>> monitoring lock, but there is nothing to release.
>>>
>>>   * The following warning log is printed -
>>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release
>>> exclusive lock which does not exist, lock key:
>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started.
>>> Why? I'm not 100% sure but I guess the late processing of the
>>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the
>>> late + mupltiple processing of failure is root cause.
>>>
>>>
>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
>>> trying to be released three times. Please share your opinion regarding how
>>> it should be fixed.
>>>
>>>
>>> Thanks,
>>>
>>> Alona.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg  wrote:

 On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
>
>
>
> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>>
>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>> Is it still failing?
>>
>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
>> wrote:
>>>
>>> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
>>> > No, I am afraid that we have not managed to understand why setting
>>> > and
>>> > ipv6 address too the host off the grid. We shall continue
>>> > researching
>>> > this next week.
>>> >
>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old,
>>> > but
>>> > could it possibly be related (I really doubt that)?
>>> >
>
>
> Sorry, but I do not see how this problem is related to VDSM.
> There is nothing that indicates that there is a VDSM problem.
>
>

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-12 Thread Michal Skrivanek



> On 12 Apr 2018, at 17:15, Dafna Ron  wrote:
> 
> hi,
> 
> we are failing randomly on test 006_migrations.migrate_vm with what seems to 
> be the same issue. 
> 
> the vm seems to be migrated successfully but engine thinks that it failed and 
> re-calls migration getting a response of vm already exists. 
> 
> I don't think this is an issue with the test but rather a regression so I 
> opened a bug: 

i do not think so, I’ve heard someone removed a test in between migrating A->B 
and B->A?
If that’s the case that is the real issue. 
You can’t migrate back to A without waiting for A to be cleared out properly
https://gerrit.ovirt.org/#/c/90166/  
should fix it

> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1566594 
> 
> 
> Thanks, 
> Dafna
> 
> 
> On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal  > wrote:
> Arik Hadas mailto:aha...@redhat.com>> writes:
> 
> > On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan  > > wrote:
> >
> >>
> >>
> >> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim  >> > wrote:
> >>
> >>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
> >>>
> >>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/ 
> >>> 
> >>>
> >>
> >> Seems like another bug. The migration failed since for some reason the vm
> >> is already defined on the destination host.
> >>
> >> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
> >> error=Virtual machine already exists (api:129)
> >> Traceback (most recent call last):
> >> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
> >> method
> >> ret = func(*args, **kwargs)
> >> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
> >> raise exception.VMExists()
> >> VMExists: Virtual machine already exists
> >>
> >>
> > Milan, Francesco, could it be that because of [1] that appears on the
> > destination host right after shutting down the VM, it remained defined on
> > that host?
> 
> I can't see any destroy call in the logs after the successful preceding
> migration from the given host.  That would explain “VMExists” error.
> 
> > [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running
> > VM callback (clientIF:683)
> >
> > Traceback (most recent call last):
> >
> >   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
> > dispatchLibvirtEvents
> >
> > v.onLibvirtLifecycleEvent(event, detail, None)
> >
> > AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'
> 
> That means that a life cycle event on an unknown VM has arrived, in this
> case apparently destroy event, following the destroy call after the
> failed incoming migration.  The reported AttributeError is a minor bug,
> already fixed in master.  So it's most likely unrelated to the discussed
> problem.
> 
> >>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan  >>> >
> >>> wrote:
> >>>
>  Hi all,
> 
>  Looking at the log it seems that the new GetCapabilitiesAsync is
>  responsible for the mess.
> 
>  -
>  * 08:29:47 - engine loses connectivity to host 
>  'lago-basic-suite-4-2-host-0'.*
> 
> 
> 
>  *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
>  (unsuccessfully).*
> 
>   * before each "getCapabilitiesAsync" the monitoring lock is taken 
>  (VdsManager,refreshImpl)
> 
>   * "getCapabilitiesAsync" immediately fails and throws 
>  'VDSNetworkException: java.net.ConnectException: Connection refused'. 
>  The exception is caught by 
>  'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
>  'onFailure' of the callback and re-throws the exception.
> 
>   catch (Throwable t) {
>  getParameters().getCallback().onFailure(t);
>  throw t;
>   }
> 
>  * The 'onFailure' of the callback releases the "monitoringLock" 
>  ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
>  lockManager.releaseLock(monitoringLock);')
> 
>  * 'VdsManager,refreshImpl' catches the network exception, marks 
>  'releaseLock = true' and *tries to release the already released lock*.
> 
>    The following warning is printed to the log -
> 
>    WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>  (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
>  exclusive lock which does not exist, lock key: 
>  'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> 
> 
> 
> 
>  *- 08:30:51 a successful getCapabilitiesAsync is sent.*
> 
> 
>  *- 08:32:55 - The failing test starts (Setup Networks for setti

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-12 Thread Dafna Ron

hi,

we are failing randomly on test 006_migrations.migrate_vm with what seems
to be the same issue.

the vm seems to be migrated successfully but engine thinks that it failed
and re-calls migration getting a response of vm already exists.

I don't think this is an issue with the test but rather a regression so I
opened a bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1566594

Thanks,
Dafna


On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal  wrote:

> Arik Hadas  writes:
>
> > On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan 
> wrote:
> >
> >>
> >>
> >> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim 
> wrote:
> >>
> >>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
> >>>
> >>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
> >>>
> >>
> >> Seems like another bug. The migration failed since for some reason the
> vm
> >> is already defined on the destination host.
> >>
> >> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
> >> error=Virtual machine already exists (api:129)
> >> Traceback (most recent call last):
> >> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122,
> in
> >> method
> >> ret = func(*args, **kwargs)
> >> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in
> create
> >> raise exception.VMExists()
> >> VMExists: Virtual machine already exists
> >>
> >>
> > Milan, Francesco, could it be that because of [1] that appears on the
> > destination host right after shutting down the VM, it remained defined on
> > that host?
>
> I can't see any destroy call in the logs after the successful preceding
> migration from the given host.  That would explain “VMExists” error.
>
> > [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error
> running
> > VM callback (clientIF:683)
> >
> > Traceback (most recent call last):
> >
> >   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
> > dispatchLibvirtEvents
> >
> > v.onLibvirtLifecycleEvent(event, detail, None)
> >
> > AttributeError: 'NoneType' object has no attribute
> 'onLibvirtLifecycleEvent'
>
> That means that a life cycle event on an unknown VM has arrived, in this
> case apparently destroy event, following the destroy call after the
> failed incoming migration.  The reported AttributeError is a minor bug,
> already fixed in master.  So it's most likely unrelated to the discussed
> problem.
>
> >>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan 
> >>> wrote:
> >>>
>  Hi all,
> 
>  Looking at the log it seems that the new GetCapabilitiesAsync is
>  responsible for the mess.
> 
>  -
>  * 08:29:47 - engine loses connectivity to host
> 'lago-basic-suite-4-2-host-0'.*
> 
> 
> 
>  *- Every 3 seconds a getCapabalititiesAsync request is sent to the
> host (unsuccessfully).*
> 
>   * before each "getCapabilitiesAsync" the monitoring lock is
> taken (VdsManager,refreshImpl)
> 
>   * "getCapabilitiesAsync" immediately fails and throws
> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The
> exception is caught by 
> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand'
> which calls 'onFailure' of the callback and re-throws the exception.
> 
>   catch (Throwable t) {
>  getParameters().getCallback().onFailure(t);
>  throw t;
>   }
> 
>  * The 'onFailure' of the callback releases the "monitoringLock"
> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
> lockManager.releaseLock(monitoringLock);')
> 
>  * 'VdsManager,refreshImpl' catches the network exception, marks
> 'releaseLock = true' and *tries to release the already released lock*.
> 
>    The following warning is printed to the log -
> 
>    WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release
> exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-
> c4aa4e19bd93VDS_INIT'
> 
> 
> 
> 
>  *- 08:30:51 a successful getCapabilitiesAsync is sent.*
> 
> 
>  *- 08:32:55 - The failing test starts (Setup Networks for setting
> ipv6).*
> 
>  * SetupNetworks takes the monitoring lock.
> 
>  *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync
> requests from 4 minutes ago from its queue and prints a
> VDSNetworkException: Vds timeout occured.*
> 
>    * When the first request is removed from the queue
> ('ResponseTracker.remove()'), the
>  *'Callback.onFailure' is invoked (for the second time) -> monitoring
> lock is released (the lock taken by the SetupNetworks!).*
> 
>    * *The other requests removed from the queue also try to
> release the monitoring lock*, but there is nothing to release.
> 
>    * The following warning log is printed -
>  WARN  [org.ovirt.engine.core.b

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-11 Thread Milan Zamazal

Arik Hadas  writes:

> On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan  wrote:
>
>>
>>
>> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim  wrote:
>>
>>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
>>>
>>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
>>>
>>
>> Seems like another bug. The migration failed since for some reason the vm
>> is already defined on the destination host.
>>
>> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
>> error=Virtual machine already exists (api:129)
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
>> method
>> ret = func(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
>> raise exception.VMExists()
>> VMExists: Virtual machine already exists
>>
>>
> Milan, Francesco, could it be that because of [1] that appears on the
> destination host right after shutting down the VM, it remained defined on
> that host?

I can't see any destroy call in the logs after the successful preceding
migration from the given host.  That would explain “VMExists” error.

> [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running
> VM callback (clientIF:683)
>
> Traceback (most recent call last):
>
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
> dispatchLibvirtEvents
>
> v.onLibvirtLifecycleEvent(event, detail, None)
>
> AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'

That means that a life cycle event on an unknown VM has arrived, in this
case apparently destroy event, following the destroy call after the
failed incoming migration.  The reported AttributeError is a minor bug,
already fixed in master.  So it's most likely unrelated to the discussed
problem.

>>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan 
>>> wrote:
>>>
 Hi all,

 Looking at the log it seems that the new GetCapabilitiesAsync is
 responsible for the mess.

 -
 * 08:29:47 - engine loses connectivity to host 
 'lago-basic-suite-4-2-host-0'.*



 *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
 (unsuccessfully).*

  * before each "getCapabilitiesAsync" the monitoring lock is taken 
 (VdsManager,refreshImpl)

  * "getCapabilitiesAsync" immediately fails and throws 
 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
 exception is caught by 
 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
 'onFailure' of the callback and re-throws the exception.

  catch (Throwable t) {
 getParameters().getCallback().onFailure(t);
 throw t;
  }

 * The 'onFailure' of the callback releases the "monitoringLock" 
 ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
 lockManager.releaseLock(monitoringLock);')

 * 'VdsManager,refreshImpl' catches the network exception, marks 
 'releaseLock = true' and *tries to release the already released lock*.

   The following warning is printed to the log -

   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
 (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
 exclusive lock which does not exist, lock key: 
 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'




 *- 08:30:51 a successful getCapabilitiesAsync is sent.*


 *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).   
  *

 * SetupNetworks takes the monitoring lock.

 *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests 
 from 4 minutes ago from its queue and prints a VDSNetworkException: Vds 
 timeout occured.*

   * When the first request is removed from the queue 
 ('ResponseTracker.remove()'), the
 *'Callback.onFailure' is invoked (for the second time) -> monitoring lock 
 is released (the lock taken by the SetupNetworks!).*

   * *The other requests removed from the queue also try to release the 
 monitoring lock*, but there is nothing to release.

   * The following warning log is printed -
 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
 (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
 exclusive lock which does not exist, lock key: 
 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

 - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
 Why? I'm not 100% sure but I guess the late processing of the 
 'getCapabilitiesAsync' that causes losing of the monitoring lock and the 
 late + mupltiple processing of failure is root cause.


 Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-11 Thread Arik Hadas

On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan  wrote:

>
>
> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim  wrote:
>
>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
>>
>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
>>
>
> Seems like another bug. The migration failed since for some reason the vm
> is already defined on the destination host.
>
> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
> error=Virtual machine already exists (api:129)
> Traceback (most recent call last):
> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
> method
> ret = func(*args, **kwargs)
> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
> raise exception.VMExists()
> VMExists: Virtual machine already exists
>
>
Milan, Francesco, could it be that because of [1] that appears on the
destination host right after shutting down the VM, it remained defined on
that host?

[1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running
VM callback (clientIF:683)

Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
dispatchLibvirtEvents

v.onLibvirtLifecycleEvent(event, detail, None)

AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'



>
>
>>
>>
>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan 
>> wrote:
>>
>>> Hi all,
>>>
>>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>> responsible for the mess.
>>>
>>> -
>>> * 08:29:47 - engine loses connectivity to host 
>>> 'lago-basic-suite-4-2-host-0'.*
>>>
>>>
>>>
>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
>>> (unsuccessfully).*
>>>
>>>  * before each "getCapabilitiesAsync" the monitoring lock is taken 
>>> (VdsManager,refreshImpl)
>>>
>>>  * "getCapabilitiesAsync" immediately fails and throws 
>>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
>>> exception is caught by 
>>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
>>> 'onFailure' of the callback and re-throws the exception.
>>>
>>>  catch (Throwable t) {
>>> getParameters().getCallback().onFailure(t);
>>> throw t;
>>>  }
>>>
>>> * The 'onFailure' of the callback releases the "monitoringLock" 
>>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
>>> lockManager.releaseLock(monitoringLock);')
>>>
>>> * 'VdsManager,refreshImpl' catches the network exception, marks 
>>> 'releaseLock = true' and *tries to release the already released lock*.
>>>
>>>   The following warning is printed to the log -
>>>
>>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
>>> exclusive lock which does not exist, lock key: 
>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>>
>>>
>>>
>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>>
>>>
>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).
>>> *
>>>
>>> * SetupNetworks takes the monitoring lock.
>>>
>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 
>>> 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
>>> occured.*
>>>
>>>   * When the first request is removed from the queue 
>>> ('ResponseTracker.remove()'), the
>>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock 
>>> is released (the lock taken by the SetupNetworks!).*
>>>
>>>   * *The other requests removed from the queue also try to release the 
>>> monitoring lock*, but there is nothing to release.
>>>
>>>   * The following warning log is printed -
>>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
>>> exclusive lock which does not exist, lock key: 
>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
>>> Why? I'm not 100% sure but I guess the late processing of the 
>>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the 
>>> late + mupltiple processing of failure is root cause.
>>>
>>>
>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is 
>>> trying to be released three times. Please share your opinion regarding how 
>>> it should be fixed.
>>>
>>>
>>> Thanks,
>>>
>>> Alona.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg 
>>> wrote:
>>>
 On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:

>
>
> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>
>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>> Is it still failing?
>>
>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
>> wrote:
>>
>>> On 7 April 2018 at 00:30, Dan

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-11 Thread Alona Kaplan

On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim  wrote:

> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
>
> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
>

Seems like another bug. The migration failed since for some reason the vm
is already defined on the destination host.

2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
error=Virtual machine already exists (api:129)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
raise exception.VMExists()
VMExists: Virtual machine already exists




>
>
> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan  wrote:
>
>> Hi all,
>>
>> Looking at the log it seems that the new GetCapabilitiesAsync is
>> responsible for the mess.
>>
>> -
>> * 08:29:47 - engine loses connectivity to host 
>> 'lago-basic-suite-4-2-host-0'.*
>>
>>
>>
>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
>> (unsuccessfully).*
>>
>>  * before each "getCapabilitiesAsync" the monitoring lock is taken 
>> (VdsManager,refreshImpl)
>>
>>  * "getCapabilitiesAsync" immediately fails and throws 
>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
>> exception is caught by 
>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
>> 'onFailure' of the callback and re-throws the exception.
>>
>>  catch (Throwable t) {
>> getParameters().getCallback().onFailure(t);
>> throw t;
>>  }
>>
>> * The 'onFailure' of the callback releases the "monitoringLock" 
>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
>> lockManager.releaseLock(monitoringLock);')
>>
>> * 'VdsManager,refreshImpl' catches the network exception, marks 
>> 'releaseLock = true' and *tries to release the already released lock*.
>>
>>   The following warning is printed to the log -
>>
>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
>> exclusive lock which does not exist, lock key: 
>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>
>>
>>
>>
>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>
>>
>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).*
>>
>> * SetupNetworks takes the monitoring lock.
>>
>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 
>> 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
>> occured.*
>>
>>   * When the first request is removed from the queue 
>> ('ResponseTracker.remove()'), the
>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is 
>> released (the lock taken by the SetupNetworks!).*
>>
>>   * *The other requests removed from the queue also try to release the 
>> monitoring lock*, but there is nothing to release.
>>
>>   * The following warning log is printed -
>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
>> exclusive lock which does not exist, lock key: 
>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>
>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
>> Why? I'm not 100% sure but I guess the late processing of the 
>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the 
>> late + mupltiple processing of failure is root cause.
>>
>>
>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying 
>> to be released three times. Please share your opinion regarding how it 
>> should be fixed.
>>
>>
>> Thanks,
>>
>> Alona.
>>
>>
>>
>>
>>
>>
>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg  wrote:
>>
>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
>>>


 On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:

> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
> Is it still failing?
>
> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
> wrote:
>
>> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
>> > No, I am afraid that we have not managed to understand why setting
>> and
>> > ipv6 address too the host off the grid. We shall continue
>> researching
>> > this next week.
>> >
>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old,
>> but
>> > could it possibly be related (I really doubt that)?
>> >
>>
>
 Sorry, but I do not see how this problem is related to VDSM.
 There is nothing that indicates that there is a VDSM problem.

 Has the RPC connection between Engine and VDSM failed?


>>> Further up the thread, Piotr noticed that (at least on one failure of
>>> this test) that the Vdsm host lost connectivity to its stor

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-11 Thread Alona Kaplan

Hi Ravi,

Added comments to the patch.

Regarding the lock - the lock shouldn't be released until the command and
its callbacks were finished. The treatment of a delayed failure should be
under lock since it doesn't make sense to take care of the failure while
other monitoring process is possibly running.

Besides the locking issue, IMO the main problem is the delayed failures.
In case the vdsm is down, why is there an immediate exception and a delayed
one? The delayed one is redundant. Anyway, 'callback.onFailure' shouldn't
be executed twice.

Thanks,

Alona.


On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori  wrote:

> This [1] should fix the multiple release lock issue
>
> [1] https://gerrit.ovirt.org/#/c/90077/
>
> On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori 
> wrote:
>
>> Working on a patch will post a fix
>>
>> Thanks
>>
>> Ravi
>>
>> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan 
>> wrote:
>>
>>> Hi all,
>>>
>>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>> responsible for the mess.
>>>
>>> -
>>> * 08:29:47 - engine loses connectivity to host 
>>> 'lago-basic-suite-4-2-host-0'.*
>>>
>>>
>>>
>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
>>> (unsuccessfully).*
>>>
>>>  * before each "getCapabilitiesAsync" the monitoring lock is taken 
>>> (VdsManager,refreshImpl)
>>>
>>>  * "getCapabilitiesAsync" immediately fails and throws 
>>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
>>> exception is caught by 
>>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
>>> 'onFailure' of the callback and re-throws the exception.
>>>
>>>  catch (Throwable t) {
>>> getParameters().getCallback().onFailure(t);
>>> throw t;
>>>  }
>>>
>>> * The 'onFailure' of the callback releases the "monitoringLock" 
>>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
>>> lockManager.releaseLock(monitoringLock);')
>>>
>>> * 'VdsManager,refreshImpl' catches the network exception, marks 
>>> 'releaseLock = true' and *tries to release the already released lock*.
>>>
>>>   The following warning is printed to the log -
>>>
>>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
>>> exclusive lock which does not exist, lock key: 
>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>>
>>>
>>>
>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>>
>>>
>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).
>>> *
>>>
>>> * SetupNetworks takes the monitoring lock.
>>>
>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 
>>> 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
>>> occured.*
>>>
>>>   * When the first request is removed from the queue 
>>> ('ResponseTracker.remove()'), the
>>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock 
>>> is released (the lock taken by the SetupNetworks!).*
>>>
>>>   * *The other requests removed from the queue also try to release the 
>>> monitoring lock*, but there is nothing to release.
>>>
>>>   * The following warning log is printed -
>>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
>>> exclusive lock which does not exist, lock key: 
>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
>>> Why? I'm not 100% sure but I guess the late processing of the 
>>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the 
>>> late + mupltiple processing of failure is root cause.
>>>
>>>
>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is 
>>> trying to be released three times. Please share your opinion regarding how 
>>> it should be fixed.
>>>
>>>
>>> Thanks,
>>>
>>> Alona.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg 
>>> wrote:
>>>
 On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:

>
>
> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>
>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>> Is it still failing?
>>
>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
>> wrote:
>>
>>> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
>>> > No, I am afraid that we have not managed to understand why setting
>>> and
>>> > ipv6 address too the host off the grid. We shall continue
>>> researching
>>> > this next week.
>>> >
>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old,
>>> but
>>> > could it possibly be related (I really doubt that)?
>>> >
>>>
>>
> Sorry, but I do not see how this problem is related to VDSM.
> There is

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-10 Thread Ravi Shankar Nori

This [1] should fix the multiple release lock issue

[1] https://gerrit.ovirt.org/#/c/90077/

On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori  wrote:

> Working on a patch will post a fix
>
> Thanks
>
> Ravi
>
> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan  wrote:
>
>> Hi all,
>>
>> Looking at the log it seems that the new GetCapabilitiesAsync is
>> responsible for the mess.
>>
>> -
>> * 08:29:47 - engine loses connectivity to host 
>> 'lago-basic-suite-4-2-host-0'.*
>>
>>
>>
>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
>> (unsuccessfully).*
>>
>>  * before each "getCapabilitiesAsync" the monitoring lock is taken 
>> (VdsManager,refreshImpl)
>>
>>  * "getCapabilitiesAsync" immediately fails and throws 
>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
>> exception is caught by 
>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
>> 'onFailure' of the callback and re-throws the exception.
>>
>>  catch (Throwable t) {
>> getParameters().getCallback().onFailure(t);
>> throw t;
>>  }
>>
>> * The 'onFailure' of the callback releases the "monitoringLock" 
>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
>> lockManager.releaseLock(monitoringLock);')
>>
>> * 'VdsManager,refreshImpl' catches the network exception, marks 
>> 'releaseLock = true' and *tries to release the already released lock*.
>>
>>   The following warning is printed to the log -
>>
>>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
>> exclusive lock which does not exist, lock key: 
>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>
>>
>>
>>
>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>
>>
>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).*
>>
>> * SetupNetworks takes the monitoring lock.
>>
>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 
>> 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
>> occured.*
>>
>>   * When the first request is removed from the queue 
>> ('ResponseTracker.remove()'), the
>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is 
>> released (the lock taken by the SetupNetworks!).*
>>
>>   * *The other requests removed from the queue also try to release the 
>> monitoring lock*, but there is nothing to release.
>>
>>   * The following warning log is printed -
>> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
>> exclusive lock which does not exist, lock key: 
>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>
>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
>> Why? I'm not 100% sure but I guess the late processing of the 
>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the 
>> late + mupltiple processing of failure is root cause.
>>
>>
>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying 
>> to be released three times. Please share your opinion regarding how it 
>> should be fixed.
>>
>>
>> Thanks,
>>
>> Alona.
>>
>>
>>
>>
>>
>>
>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg  wrote:
>>
>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
>>>


 On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:

> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
> Is it still failing?
>
> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
> wrote:
>
>> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
>> > No, I am afraid that we have not managed to understand why setting
>> and
>> > ipv6 address too the host off the grid. We shall continue
>> researching
>> > this next week.
>> >
>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old,
>> but
>> > could it possibly be related (I really doubt that)?
>> >
>>
>
 Sorry, but I do not see how this problem is related to VDSM.
 There is nothing that indicates that there is a VDSM problem.

 Has the RPC connection between Engine and VDSM failed?


>>> Further up the thread, Piotr noticed that (at least on one failure of
>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
>>> process was restarted. However, this does not seems to happen in all cases
>>> where this test fails.
>>>
>>> ___
>>> Devel mailing list
>>> Devel@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>
>>
>>
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-10 Thread Ravi Shankar Nori

Working on a patch will post a fix

Thanks

Ravi

On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan  wrote:

> Hi all,
>
> Looking at the log it seems that the new GetCapabilitiesAsync is
> responsible for the mess.
>
> -
> * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*
>
>
>
> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
> (unsuccessfully).*
>
>  * before each "getCapabilitiesAsync" the monitoring lock is taken 
> (VdsManager,refreshImpl)
>
>  * "getCapabilitiesAsync" immediately fails and throws 
> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
> exception is caught by 
> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
> 'onFailure' of the callback and re-throws the exception.
>
>  catch (Throwable t) {
> getParameters().getCallback().onFailure(t);
> throw t;
>  }
>
> * The 'onFailure' of the callback releases the "monitoringLock" 
> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
> lockManager.releaseLock(monitoringLock);')
>
> * 'VdsManager,refreshImpl' catches the network exception, marks 
> 'releaseLock = true' and *tries to release the already released lock*.
>
>   The following warning is printed to the log -
>
>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
> exclusive lock which does not exist, lock key: 
> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>
>
>
>
> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>
>
> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).*
>
> * SetupNetworks takes the monitoring lock.
>
> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 
> minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
> occured.*
>
>   * When the first request is removed from the queue 
> ('ResponseTracker.remove()'), the
> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is 
> released (the lock taken by the SetupNetworks!).*
>
>   * *The other requests removed from the queue also try to release the 
> monitoring lock*, but there is nothing to release.
>
>   * The following warning log is printed -
> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
> exclusive lock which does not exist, lock key: 
> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>
> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
> Why? I'm not 100% sure but I guess the late processing of the 
> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late 
> + mupltiple processing of failure is root cause.
>
>
> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying 
> to be released three times. Please share your opinion regarding how it should 
> be fixed.
>
>
> Thanks,
>
> Alona.
>
>
>
>
>
>
> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg  wrote:
>
>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
>>
>>>
>>>
>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>>>
 Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
 Is it still failing?

 On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
 wrote:

> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
> > No, I am afraid that we have not managed to understand why setting
> and
> > ipv6 address too the host off the grid. We shall continue researching
> > this next week.
> >
> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
> > could it possibly be related (I really doubt that)?
> >
>

>>> Sorry, but I do not see how this problem is related to VDSM.
>>> There is nothing that indicates that there is a VDSM problem.
>>>
>>> Has the RPC connection between Engine and VDSM failed?
>>>
>>>
>> Further up the thread, Piotr noticed that (at least on one failure of
>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
>> process was restarted. However, this does not seems to happen in all cases
>> where this test fails.
>>
>> ___
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-10 Thread Gal Ben Haim

I'm seeing the same error in [1], during 006_migrations.migrate_vm.

[1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/

On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan  wrote:

> Hi all,
>
> Looking at the log it seems that the new GetCapabilitiesAsync is
> responsible for the mess.
>
> -
> * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*
>
>
>
> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
> (unsuccessfully).*
>
>  * before each "getCapabilitiesAsync" the monitoring lock is taken 
> (VdsManager,refreshImpl)
>
>  * "getCapabilitiesAsync" immediately fails and throws 
> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
> exception is caught by 
> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
> 'onFailure' of the callback and re-throws the exception.
>
>  catch (Throwable t) {
> getParameters().getCallback().onFailure(t);
> throw t;
>  }
>
> * The 'onFailure' of the callback releases the "monitoringLock" 
> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
> lockManager.releaseLock(monitoringLock);')
>
> * 'VdsManager,refreshImpl' catches the network exception, marks 
> 'releaseLock = true' and *tries to release the already released lock*.
>
>   The following warning is printed to the log -
>
>   WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
> exclusive lock which does not exist, lock key: 
> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>
>
>
>
> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>
>
> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).*
>
> * SetupNetworks takes the monitoring lock.
>
> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 
> minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
> occured.*
>
>   * When the first request is removed from the queue 
> ('ResponseTracker.remove()'), the
> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is 
> released (the lock taken by the SetupNetworks!).*
>
>   * *The other requests removed from the queue also try to release the 
> monitoring lock*, but there is nothing to release.
>
>   * The following warning log is printed -
> WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
> exclusive lock which does not exist, lock key: 
> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>
> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
> Why? I'm not 100% sure but I guess the late processing of the 
> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late 
> + mupltiple processing of failure is root cause.
>
>
> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying 
> to be released three times. Please share your opinion regarding how it should 
> be fixed.
>
>
> Thanks,
>
> Alona.
>
>
>
>
>
>
> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg  wrote:
>
>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
>>
>>>
>>>
>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>>>
 Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
 Is it still failing?

 On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren 
 wrote:

> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
> > No, I am afraid that we have not managed to understand why setting
> and
> > ipv6 address too the host off the grid. We shall continue researching
> > this next week.
> >
> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
> > could it possibly be related (I really doubt that)?
> >
>

>>> Sorry, but I do not see how this problem is related to VDSM.
>>> There is nothing that indicates that there is a VDSM problem.
>>>
>>> Has the RPC connection between Engine and VDSM failed?
>>>
>>>
>> Further up the thread, Piotr noticed that (at least on one failure of
>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
>> process was restarted. However, this does not seems to happen in all cases
>> where this test fails.
>>
>> ___
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>



-- 
*GAL bEN HAIM*
RHV DEVOPS
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-10 Thread Alona Kaplan

Hi all,

Looking at the log it seems that the new GetCapabilitiesAsync is
responsible for the mess.

-
* 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*



*- Every 3 seconds a getCapabalititiesAsync request is sent to the
host (unsuccessfully).*

 * before each "getCapabilitiesAsync" the monitoring lock is taken
(VdsManager,refreshImpl)

 * "getCapabilitiesAsync" immediately fails and throws
'VDSNetworkException: java.net.ConnectException: Connection refused'.
The exception is caught by
'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
'onFailure' of the callback and re-throws the exception.

 catch (Throwable t) {
getParameters().getCallback().onFailure(t);
throw t;
 }

* The 'onFailure' of the callback releases the "monitoringLock"
('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
lockManager.releaseLock(monitoringLock);')

* 'VdsManager,refreshImpl' catches the network exception, marks
'releaseLock = true' and *tries to release the already released lock*.

  The following warning is printed to the log -

  WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
release exclusive lock which does not exist, lock key:
'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'




*- 08:30:51 a successful getCapabilitiesAsync is sent.*


*- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).*

* SetupNetworks takes the monitoring lock.

*- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
from 4 minutes ago from its queue and prints a VDSNetworkException:
Vds timeout occured.*

  * When the first request is removed from the queue
('ResponseTracker.remove()'), the
*'Callback.onFailure' is invoked (for the second time) -> monitoring
lock is released (the lock taken by the SetupNetworks!).*

  * *The other requests removed from the queue also try to release
the monitoring lock*, but there is nothing to release.

  * The following warning log is printed -
WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
release exclusive lock which does not exist, lock key:
'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'

- *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
started*. Why? I'm not 100% sure but I guess the late processing of
the 'getCapabilitiesAsync' that causes losing of the monitoring lock
and the late + mupltiple processing of failure is root cause.


Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
trying to be released three times. Please share your opinion regarding
how it should be fixed.


Thanks,

Alona.






On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg  wrote:

> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:
>
>>
>>
>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>>
>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>>> Is it still failing?
>>>
>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren  wrote:
>>>
 On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
 > No, I am afraid that we have not managed to understand why setting and
 > ipv6 address too the host off the grid. We shall continue researching
 > this next week.
 >
 > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
 > could it possibly be related (I really doubt that)?
 >

>>>
>> Sorry, but I do not see how this problem is related to VDSM.
>> There is nothing that indicates that there is a VDSM problem.
>>
>> Has the RPC connection between Engine and VDSM failed?
>>
>>
> Further up the thread, Piotr noticed that (at least on one failure of this
> test) that the Vdsm host lost connectivity to its storage, and Vdsm process
> was restarted. However, this does not seems to happen in all cases where
> this test fails.
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-08 Thread Dan Kenigsberg

On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas  wrote:

>
>
> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:
>
>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>> Is it still failing?
>>
>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren  wrote:
>>
>>> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
>>> > No, I am afraid that we have not managed to understand why setting and
>>> > ipv6 address too the host off the grid. We shall continue researching
>>> > this next week.
>>> >
>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
>>> > could it possibly be related (I really doubt that)?
>>> >
>>>
>>
> Sorry, but I do not see how this problem is related to VDSM.
> There is nothing that indicates that there is a VDSM problem.
>
> Has the RPC connection between Engine and VDSM failed?
>
>
Further up the thread, Piotr noticed that (at least on one failure of this
test) that the Vdsm host lost connectivity to its storage, and Vdsm process
was restarted. However, this does not seems to happen in all cases where
this test fails.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-07 Thread Edward Haas

On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri  wrote:

> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
> Is it still failing?
>
> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren  wrote:
>
>> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
>> > No, I am afraid that we have not managed to understand why setting and
>> > ipv6 address too the host off the grid. We shall continue researching
>> > this next week.
>> >
>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
>> > could it possibly be related (I really doubt that)?
>> >
>>
>
Sorry, but I do not see how this problem is related to VDSM.
There is nothing that indicates that there is a VDSM problem.

Has the RPC connection between Engine and VDSM failed?


>
>> at this point I think we should seriously consider disabling the
>> relevant test, as its impacting a large number of changes.
>>
>> > On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron  wrote:
>> >> Dan, was there a fix for the issues?
>> >> can I have a link to the fix if there was?
>> >>
>> >> Thanks,
>> >> Dafna
>> >>
>> >>
>> >> On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim 
>> wrote:
>> >>>
>> >>> From lago's log, I see that lago collected the logs from the VMs
>> using ssh
>> >>> (after the test failed), which means
>> >>> that the VM didn't crash.
>> >>>
>> >>> On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg 
>> wrote:
>> 
>>  On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren 
>> wrote:
>>  > Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]
>>  >
>>  > Link to suspected patches:
>>  > (Probably unrelated)
>>  > https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) -
>> examples:
>>  > export template to an export domain
>>  >
>>  > This seems to happen multiple times sporadically, I thought this
>> would
>>  > be solved by
>>  > https://gerrit.ovirt.org/#/c/89781/ but it isn't.
>> 
>>  right, it is a completely unrelated issue there (with external
>> networks).
>>  here, however, the host dies while setting setupNetworks of an ipv6
>>  address. Setup network waits for Engine's confirmation at
>> 08:33:00,711
>> 
>>  http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1
>> 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_
>> logs/basic-suite-4.2/post-006_migrations.py/lago-basic-
>> suite-4-2-host-0/_var_log/vdsm/supervdsm.log
>>  but kernel messages stop at 08:33:23
>> 
>>  http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1
>> 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_
>> logs/basic-suite-4.2/post-006_migrations.py/lago-basic-
>> suite-4-2-host-0/_var_log/messages/*view*/
>> 
>>  Does the lago VM of this host crash? pause?
>> 
>> 
>>  >
>>  > Link to Job:
>>  > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/
>>  >
>>  > Link to all logs:
>>  >
>>  > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1
>> 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_
>> logs/basic-suite-4.2/post-006_migrations.py/
>>  >
>>  > Error snippet from log:
>>  >
>>  > 
>>  >
>>  > Traceback (most recent call last):
>>  >   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>  > testMethod()
>>  >   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197,
>> in
>>  > runTest
>>  > self.test(*self.arg)
>>  >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
>> line
>>  > 129, in wrapped_test
>>  > test()
>>  >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
>> line
>>  > 59, in wrapper
>>  > return func(get_test_prefix(), *args, **kwargs)
>>  >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
>> line
>>  > 78, in wrapper
>>  > prefix.virt_env.engine_vm().get_api(api_ver=4), *args,
>> **kwargs
>>  >   File
>>  > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt
>> -system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",
>>  > line 139, in prepare_migration_attachments_ipv6
>>  > engine, host_service, MIGRATION_NETWORK, ip_configuration)
>>  >   File
>>  > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt
>> -system-tests/basic-suite-4.2/test_utils/network_utils_v4.py",
>>  > line 71, in modify_ip_config
>>  > check_connectivity=True)
>>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py",
>>  > line 36729, in setup_networks
>>  > return self._internal_action(action, 'setupnetworks', None,
>>  > headers, query, wait)
>>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",
>> line
>>  > 299, in _internal_action
>>  > return future.wait() if wait else future
>>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",
>> line
>>  > 55, in wait
>>  > return self.

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-07 Thread Eyal Edri

Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
Is it still failing?

On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren  wrote:

> On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
> > No, I am afraid that we have not managed to understand why setting and
> > ipv6 address too the host off the grid. We shall continue researching
> > this next week.
> >
> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
> > could it possibly be related (I really doubt that)?
> >
>
> at this point I think we should seriously consider disabling the
> relevant test, as its impacting a large number of changes.
>
> > On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron  wrote:
> >> Dan, was there a fix for the issues?
> >> can I have a link to the fix if there was?
> >>
> >> Thanks,
> >> Dafna
> >>
> >>
> >> On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim 
> wrote:
> >>>
> >>> From lago's log, I see that lago collected the logs from the VMs using
> ssh
> >>> (after the test failed), which means
> >>> that the VM didn't crash.
> >>>
> >>> On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg 
> wrote:
> 
>  On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren 
> wrote:
>  > Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]
>  >
>  > Link to suspected patches:
>  > (Probably unrelated)
>  > https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples:
>  > export template to an export domain
>  >
>  > This seems to happen multiple times sporadically, I thought this
> would
>  > be solved by
>  > https://gerrit.ovirt.org/#/c/89781/ but it isn't.
> 
>  right, it is a completely unrelated issue there (with external
> networks).
>  here, however, the host dies while setting setupNetworks of an ipv6
>  address. Setup network waits for Engine's confirmation at 08:33:00,711
> 
>  http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/
> 1537/artifact/exported-artifacts/basic-suit-4.2-el7/
> test_logs/basic-suite-4.2/post-006_migrations.py/lago-
> basic-suite-4-2-host-0/_var_log/vdsm/supervdsm.log
>  but kernel messages stop at 08:33:23
> 
>  http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/
> 1537/artifact/exported-artifacts/basic-suit-4.2-el7/
> test_logs/basic-suite-4.2/post-006_migrations.py/lago-
> basic-suite-4-2-host-0/_var_log/messages/*view*/
> 
>  Does the lago VM of this host crash? pause?
> 
> 
>  >
>  > Link to Job:
>  > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/
>  >
>  > Link to all logs:
>  >
>  > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/
> 1537/artifact/exported-artifacts/basic-suit-4.2-el7/
> test_logs/basic-suite-4.2/post-006_migrations.py/
>  >
>  > Error snippet from log:
>  >
>  > 
>  >
>  > Traceback (most recent call last):
>  >   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>  > testMethod()
>  >   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197,
> in
>  > runTest
>  > self.test(*self.arg)
>  >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
> line
>  > 129, in wrapped_test
>  > test()
>  >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
> line
>  > 59, in wrapper
>  > return func(get_test_prefix(), *args, **kwargs)
>  >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py",
> line
>  > 78, in wrapper
>  > prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs
>  >   File
>  > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/
> ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",
>  > line 139, in prepare_migration_attachments_ipv6
>  > engine, host_service, MIGRATION_NETWORK, ip_configuration)
>  >   File
>  > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/
> ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py",
>  > line 71, in modify_ip_config
>  > check_connectivity=True)
>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py",
>  > line 36729, in setup_networks
>  > return self._internal_action(action, 'setupnetworks', None,
>  > headers, query, wait)
>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",
> line
>  > 299, in _internal_action
>  > return future.wait() if wait else future
>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",
> line
>  > 55, in wait
>  > return self._code(response)
>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",
> line
>  > 296, in callback
>  > self._check_fault(response)
>  >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py",
> line
>  > 132, in _check_fault
>  > self._raise_error(response, body)
>  >   File "/usr/lib64/python2.7/site-packa

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-07 Thread Barak Korren

On 7 April 2018 at 00:30, Dan Kenigsberg  wrote:
> No, I am afraid that we have not managed to understand why setting and
> ipv6 address too the host off the grid. We shall continue researching
> this next week.
>
> Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
> could it possibly be related (I really doubt that)?
>

at this point I think we should seriously consider disabling the
relevant test, as its impacting a large number of changes.

> On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron  wrote:
>> Dan, was there a fix for the issues?
>> can I have a link to the fix if there was?
>>
>> Thanks,
>> Dafna
>>
>>
>> On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim  wrote:
>>>
>>> From lago's log, I see that lago collected the logs from the VMs using ssh
>>> (after the test failed), which means
>>> that the VM didn't crash.
>>>
>>> On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg  wrote:

 On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren  wrote:
 > Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]
 >
 > Link to suspected patches:
 > (Probably unrelated)
 > https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples:
 > export template to an export domain
 >
 > This seems to happen multiple times sporadically, I thought this would
 > be solved by
 > https://gerrit.ovirt.org/#/c/89781/ but it isn't.

 right, it is a completely unrelated issue there (with external networks).
 here, however, the host dies while setting setupNetworks of an ipv6
 address. Setup network waits for Engine's confirmation at 08:33:00,711

 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/lago-basic-suite-4-2-host-0/_var_log/vdsm/supervdsm.log
 but kernel messages stop at 08:33:23

 http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/lago-basic-suite-4-2-host-0/_var_log/messages/*view*/

 Does the lago VM of this host crash? pause?


 >
 > Link to Job:
 > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/
 >
 > Link to all logs:
 >
 > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/
 >
 > Error snippet from log:
 >
 > 
 >
 > Traceback (most recent call last):
 >   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
 > testMethod()
 >   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in
 > runTest
 > self.test(*self.arg)
 >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
 > 129, in wrapped_test
 > test()
 >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
 > 59, in wrapper
 > return func(get_test_prefix(), *args, **kwargs)
 >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
 > 78, in wrapper
 > prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs
 >   File
 > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",
 > line 139, in prepare_migration_attachments_ipv6
 > engine, host_service, MIGRATION_NETWORK, ip_configuration)
 >   File
 > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py",
 > line 71, in modify_ip_config
 > check_connectivity=True)
 >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py",
 > line 36729, in setup_networks
 > return self._internal_action(action, 'setupnetworks', None,
 > headers, query, wait)
 >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
 > 299, in _internal_action
 > return future.wait() if wait else future
 >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
 > 55, in wait
 > return self._code(response)
 >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
 > 296, in callback
 > self._check_fault(response)
 >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
 > 132, in _check_fault
 > self._raise_error(response, body)
 >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
 > 118, in _raise_error
 > raise error
 > Error: Fault reason is "Operation Failed". Fault detail is "[Network
 > error during communication with the Host.]". HTTP response code is
 > 400.
 >
 >
 >
 > 
 >
 >
 >
 > --
 > Barak Korren
 > RHV DevOps team , RHCE, RHCi
 > Red Hat EMEA
 > redhat.com | TRI

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-06 Thread Dan Kenigsberg

No, I am afraid that we have not managed to understand why setting and
ipv6 address too the host off the grid. We shall continue researching
this next week.

Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, but
could it possibly be related (I really doubt that)?

On Fri, Apr 6, 2018 at 2:20 PM, Dafna Ron  wrote:
> Dan, was there a fix for the issues?
> can I have a link to the fix if there was?
>
> Thanks,
> Dafna
>
>
> On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim  wrote:
>>
>> From lago's log, I see that lago collected the logs from the VMs using ssh
>> (after the test failed), which means
>> that the VM didn't crash.
>>
>> On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg  wrote:
>>>
>>> On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren  wrote:
>>> > Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]
>>> >
>>> > Link to suspected patches:
>>> > (Probably unrelated)
>>> > https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples:
>>> > export template to an export domain
>>> >
>>> > This seems to happen multiple times sporadically, I thought this would
>>> > be solved by
>>> > https://gerrit.ovirt.org/#/c/89781/ but it isn't.
>>>
>>> right, it is a completely unrelated issue there (with external networks).
>>> here, however, the host dies while setting setupNetworks of an ipv6
>>> address. Setup network waits for Engine's confirmation at 08:33:00,711
>>>
>>> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/lago-basic-suite-4-2-host-0/_var_log/vdsm/supervdsm.log
>>> but kernel messages stop at 08:33:23
>>>
>>> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/lago-basic-suite-4-2-host-0/_var_log/messages/*view*/
>>>
>>> Does the lago VM of this host crash? pause?
>>>
>>>
>>> >
>>> > Link to Job:
>>> > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/
>>> >
>>> > Link to all logs:
>>> >
>>> > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/
>>> >
>>> > Error snippet from log:
>>> >
>>> > 
>>> >
>>> > Traceback (most recent call last):
>>> >   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>>> > testMethod()
>>> >   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in
>>> > runTest
>>> > self.test(*self.arg)
>>> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
>>> > 129, in wrapped_test
>>> > test()
>>> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
>>> > 59, in wrapper
>>> > return func(get_test_prefix(), *args, **kwargs)
>>> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
>>> > 78, in wrapper
>>> > prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs
>>> >   File
>>> > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",
>>> > line 139, in prepare_migration_attachments_ipv6
>>> > engine, host_service, MIGRATION_NETWORK, ip_configuration)
>>> >   File
>>> > "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py",
>>> > line 71, in modify_ip_config
>>> > check_connectivity=True)
>>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py",
>>> > line 36729, in setup_networks
>>> > return self._internal_action(action, 'setupnetworks', None,
>>> > headers, query, wait)
>>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>>> > 299, in _internal_action
>>> > return future.wait() if wait else future
>>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>>> > 55, in wait
>>> > return self._code(response)
>>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>>> > 296, in callback
>>> > self._check_fault(response)
>>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>>> > 132, in _check_fault
>>> > self._raise_error(response, body)
>>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>>> > 118, in _raise_error
>>> > raise error
>>> > Error: Fault reason is "Operation Failed". Fault detail is "[Network
>>> > error during communication with the Host.]". HTTP response code is
>>> > 400.
>>> >
>>> >
>>> >
>>> > 
>>> >
>>> >
>>> >
>>> > --
>>> > Barak Korren
>>> > RHV DevOps team , RHCE, RHCi
>>> > Red Hat EMEA
>>> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>>> > ___
>>> > Devel mailing list
>>> > Devel@ovirt.org
>>> > http://lists.ovirt.org/mailman/listinfo/devel
>>> >
>>> >
>>> ___
>>> Devel mailing list
>>> Devel@ovirt

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-06 Thread Dafna Ron

Dan, was there a fix for the issues?
can I have a link to the fix if there was?

Thanks,
Dafna


On Wed, Apr 4, 2018 at 5:01 PM, Gal Ben Haim  wrote:

> From lago's log, I see that lago collected the logs from the VMs using ssh
> (after the test failed), which means
> that the VM didn't crash.
>
> On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg  wrote:
>
>> On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren  wrote:
>> > Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]
>> >
>> > Link to suspected patches:
>> > (Probably unrelated)
>> > https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples:
>> > export template to an export domain
>> >
>> > This seems to happen multiple times sporadically, I thought this would
>> > be solved by
>> > https://gerrit.ovirt.org/#/c/89781/ but it isn't.
>>
>> right, it is a completely unrelated issue there (with external networks).
>> here, however, the host dies while setting setupNetworks of an ipv6
>> address. Setup network waits for Engine's confirmation at 08:33:00,711
>> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1
>> 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_
>> logs/basic-suite-4.2/post-006_migrations.py/lago-basic-
>> suite-4-2-host-0/_var_log/vdsm/supervdsm.log
>> but kernel messages stop at 08:33:23
>> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1
>> 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_
>> logs/basic-suite-4.2/post-006_migrations.py/lago-basic-
>> suite-4-2-host-0/_var_log/messages/*view*/
>>
>> Does the lago VM of this host crash? pause?
>>
>>
>> >
>> > Link to Job:
>> > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/
>> >
>> > Link to all logs:
>> > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1
>> 537/artifact/exported-artifacts/basic-suit-4.2-el7/test_
>> logs/basic-suite-4.2/post-006_migrations.py/
>> >
>> > Error snippet from log:
>> >
>> > 
>> >
>> > Traceback (most recent call last):
>> >   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
>> > testMethod()
>> >   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in
>> runTest
>> > self.test(*self.arg)
>> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
>> > 129, in wrapped_test
>> > test()
>> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
>> > 59, in wrapper
>> > return func(get_test_prefix(), *args, **kwargs)
>> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
>> > 78, in wrapper
>> > prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs
>> >   File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt
>> -system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",
>> > line 139, in prepare_migration_attachments_ipv6
>> > engine, host_service, MIGRATION_NETWORK, ip_configuration)
>> >   File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt
>> -system-tests/basic-suite-4.2/test_utils/network_utils_v4.py",
>> > line 71, in modify_ip_config
>> > check_connectivity=True)
>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py",
>> > line 36729, in setup_networks
>> > return self._internal_action(action, 'setupnetworks', None,
>> > headers, query, wait)
>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>> > 299, in _internal_action
>> > return future.wait() if wait else future
>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>> > 55, in wait
>> > return self._code(response)
>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>> > 296, in callback
>> > self._check_fault(response)
>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>> > 132, in _check_fault
>> > self._raise_error(response, body)
>> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
>> > 118, in _raise_error
>> > raise error
>> > Error: Fault reason is "Operation Failed". Fault detail is "[Network
>> > error during communication with the Host.]". HTTP response code is
>> > 400.
>> >
>> >
>> >
>> > 
>> >
>> >
>> >
>> > --
>> > Barak Korren
>> > RHV DevOps team , RHCE, RHCi
>> > Red Hat EMEA
>> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>> > ___
>> > Devel mailing list
>> > Devel@ovirt.org
>> > http://lists.ovirt.org/mailman/listinfo/devel
>> >
>> >
>> ___
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>
>
>
> --
> *GAL bEN HAIM*
> RHV DEVOPS
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-04 Thread Gal Ben Haim

>From lago's log, I see that lago collected the logs from the VMs using ssh
(after the test failed), which means
that the VM didn't crash.

On Wed, Apr 4, 2018 at 5:27 PM, Dan Kenigsberg  wrote:

> On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren  wrote:
> > Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]
> >
> > Link to suspected patches:
> > (Probably unrelated)
> > https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples:
> > export template to an export domain
> >
> > This seems to happen multiple times sporadically, I thought this would
> > be solved by
> > https://gerrit.ovirt.org/#/c/89781/ but it isn't.
>
> right, it is a completely unrelated issue there (with external networks).
> here, however, the host dies while setting setupNetworks of an ipv6
> address. Setup network waits for Engine's confirmation at 08:33:00,711
> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/
> 1537/artifact/exported-artifacts/basic-suit-4.2-el7/
> test_logs/basic-suite-4.2/post-006_migrations.py/lago-
> basic-suite-4-2-host-0/_var_log/vdsm/supervdsm.log
> but kernel messages stop at 08:33:23
> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/
> 1537/artifact/exported-artifacts/basic-suit-4.2-el7/
> test_logs/basic-suite-4.2/post-006_migrations.py/lago-
> basic-suite-4-2-host-0/_var_log/messages/*view*/
>
> Does the lago VM of this host crash? pause?
>
>
> >
> > Link to Job:
> > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/
> >
> > Link to all logs:
> > http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/
> 1537/artifact/exported-artifacts/basic-suit-4.2-el7/
> test_logs/basic-suite-4.2/post-006_migrations.py/
> >
> > Error snippet from log:
> >
> > 
> >
> > Traceback (most recent call last):
> >   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
> > testMethod()
> >   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in
> runTest
> > self.test(*self.arg)
> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> > 129, in wrapped_test
> > test()
> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> > 59, in wrapper
> > return func(get_test_prefix(), *args, **kwargs)
> >   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> > 78, in wrapper
> > prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs
> >   File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/
> ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",
> > line 139, in prepare_migration_attachments_ipv6
> > engine, host_service, MIGRATION_NETWORK, ip_configuration)
> >   File "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/
> ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py",
> > line 71, in modify_ip_config
> > check_connectivity=True)
> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py",
> > line 36729, in setup_networks
> > return self._internal_action(action, 'setupnetworks', None,
> > headers, query, wait)
> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> > 299, in _internal_action
> > return future.wait() if wait else future
> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> > 55, in wait
> > return self._code(response)
> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> > 296, in callback
> > self._check_fault(response)
> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> > 132, in _check_fault
> > self._raise_error(response, body)
> >   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> > 118, in _raise_error
> > raise error
> > Error: Fault reason is "Operation Failed". Fault detail is "[Network
> > error during communication with the Host.]". HTTP response code is
> > 400.
> >
> >
> >
> > 
> >
> >
> >
> > --
> > Barak Korren
> > RHV DevOps team , RHCE, RHCi
> > Red Hat EMEA
> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
> > ___
> > Devel mailing list
> > Devel@ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/devel
> >
> >
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>



-- 
*GAL bEN HAIM*
RHV DEVOPS
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

2018-04-04 Thread Dan Kenigsberg

On Wed, Apr 4, 2018 at 4:59 PM, Barak Korren  wrote:
> Test failed: [ 006_migrations.prepare_migration_attachments_ipv6 ]
>
> Link to suspected patches:
> (Probably unrelated)
> https://gerrit.ovirt.org/#/c/89812/1 (ovirt-engine-sdk) - examples:
> export template to an export domain
>
> This seems to happen multiple times sporadically, I thought this would
> be solved by
> https://gerrit.ovirt.org/#/c/89781/ but it isn't.

right, it is a completely unrelated issue there (with external networks).
here, however, the host dies while setting setupNetworks of an ipv6
address. Setup network waits for Engine's confirmation at 08:33:00,711
http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/lago-basic-suite-4-2-host-0/_var_log/vdsm/supervdsm.log
but kernel messages stop at 08:33:23
http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/lago-basic-suite-4-2-host-0/_var_log/messages/*view*/

Does the lago VM of this host crash? pause?


>
> Link to Job:
> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/
>
> Link to all logs:
> http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1537/artifact/exported-artifacts/basic-suit-4.2-el7/test_logs/basic-suite-4.2/post-006_migrations.py/
>
> Error snippet from log:
>
> 
>
> Traceback (most recent call last):
>   File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
> testMethod()
>   File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
> self.test(*self.arg)
>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> 129, in wrapped_test
> test()
>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> 59, in wrapper
> return func(get_test_prefix(), *args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
> 78, in wrapper
> prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs
>   File 
> "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test-scenarios/006_migrations.py",
> line 139, in prepare_migration_attachments_ipv6
> engine, host_service, MIGRATION_NETWORK, ip_configuration)
>   File 
> "/home/jenkins/workspace/ovirt-4.2_change-queue-tester/ovirt-system-tests/basic-suite-4.2/test_utils/network_utils_v4.py",
> line 71, in modify_ip_config
> check_connectivity=True)
>   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py",
> line 36729, in setup_networks
> return self._internal_action(action, 'setupnetworks', None,
> headers, query, wait)
>   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> 299, in _internal_action
> return future.wait() if wait else future
>   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> 55, in wait
> return self._code(response)
>   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> 296, in callback
> self._check_fault(response)
>   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> 132, in _check_fault
> self._raise_error(response, body)
>   File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line
> 118, in _raise_error
> raise error
> Error: Fault reason is "Operation Failed". Fault detail is "[Network
> error during communication with the Host.]". HTTP response code is
> 400.
>
>
>
> 
>
>
>
> --
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

38 matches

Mail list logo