Re: DocumentRDB builds time out on Jenkins

2015-12-11 Thread Michael Dürig



On 10.12.15 8:08 , Julian Reschke wrote:

On 2015-12-10 15:58, Julian Reschke wrote:

On 2015-12-10 15:22, Marcel Reutegger wrote:

Hi,

hmm, this could be similar to
https://issues.apache.org/jira/browse/OAK-2704

I had a closer look at the DOCUMENT_RDB fixture and I think
the problem is with the createNodeStore() method. The method
is supposed to create independent node store instances, but
the RDB implementation uses an instance field for the data source.
this means when called concurrently one store may point to
the data source created by another thread.
...


That per se wouldn't be a problem. But the dispose() method tries to
close the DataSource, which of course might be different from the one
that was used to create the NodeStore (will file an issue and fix that).


Done in the meantime.


That again wouldn't be a problem if I hadn't fixed OAK-3722 last week.
When did these timeouts start to appear?

Anyway, that still doesn't explain the confusion in the surefire logs,
no?


And build 611
() is
going to timeout again (this time for the RDBMK JDK 1.7 integration tests).



No it didn't: 
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/611/jdk=latest1.7,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/


All RDB builds consistently ran at around 0.5h for builds 611 and 615. 
This looks much better now.


Michael



Re: DocumentRDB builds time out on Jenkins

2015-12-11 Thread Julian Reschke

On 2015-12-11 16:07, Michael Dürig wrote:

...

And build 611
()
is
going to timeout again (this time for the RDBMK JDK 1.7 integration
tests).



No it didn't:
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/611/jdk=latest1.7,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/


Right, I was too pessimistic.


All RDB builds consistently ran at around 0.5h for builds 611 and 615.
This looks much better now.


But they didn't in between. So maybe we solved just one of multiple 
problems.


Best regards, Julian




Re: DocumentRDB builds time out on Jenkins

2015-12-11 Thread Michael Dürig



On 11.12.15 4:19 , Julian Reschke wrote:

On 2015-12-11 16:07, Michael Dürig wrote:

...

And build 611
()

is
going to timeout again (this time for the RDBMK JDK 1.7 integration
tests).



No it didn't:
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/611/jdk=latest1.7,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/



Right, I was too pessimistic.


All RDB builds consistently ran at around 0.5h for builds 611 and 615.
This looks much better now.


But they didn't in between. So maybe we solved just one of multiple
problems.


Rats, I didn't see 
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/614/


I was too optimistic :(

Michael


Best regards, Julian




Re: DocumentRDB builds time out on Jenkins

2015-12-11 Thread Tomek Rekawek
Hi,



On 10/12/15 15:58, "Julian Reschke"  wrote:

>Anyway, that still doesn't explain the confusion in the surefire logs, no?

I think I found a cause of this one. The custom Parallelized runner we use has 
a bug - if a test class runs for longer than 10 minutes, it will stop waiting 
for it and run the next class (while the previous is still running in the 
background). I’ve described this in the comment to OAK-3743 [1] and attached a 
patch.

This seems like a side-effect of the RepositoryTest running very slow, but not 
the main cause of the performance issues in Jenkins.

Best regards,
Tomek

[1] https://issues.apache.org/jira/browse/OAK-3743


Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Julian Reschke

On 2015-12-10 15:58, Julian Reschke wrote:

On 2015-12-10 15:22, Marcel Reutegger wrote:

Hi,

hmm, this could be similar to
https://issues.apache.org/jira/browse/OAK-2704

I had a closer look at the DOCUMENT_RDB fixture and I think
the problem is with the createNodeStore() method. The method
is supposed to create independent node store instances, but
the RDB implementation uses an instance field for the data source.
this means when called concurrently one store may point to
the data source created by another thread.
...


That per se wouldn't be a problem. But the dispose() method tries to
close the DataSource, which of course might be different from the one
that was used to create the NodeStore (will file an issue and fix that).


Done in the meantime.


That again wouldn't be a problem if I hadn't fixed OAK-3722 last week.
When did these timeouts start to appear?

Anyway, that still doesn't explain the confusion in the surefire logs, no?


And build 611 
() is 
going to timeout again (this time for the RDBMK JDK 1.7 integration tests).


Best regards, Julian


Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Michael Dürig



On 10.12.15 11:54 , Michael Dürig wrote:



On 10.12.15 11:40 , Tomek Rekawek wrote:

Bertrand already pointed out the appropriate option. Passing
**/target/unit-tests.log as the “Files to archive” should do the trick.


Ok, I added a post build action for archiving **/target/unit-tests.log.
Let's see whether this works.


Seems to work: 
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/607/jdk=jdk-1.6u45,label=Ubuntu,nsfixtures=SEGMENT_MK,profile=unittesting/


Michael


Michael


Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Bertrand Delacretaz
On Thu, Dec 10, 2015 at 11:27 AM, Michael Dürig  wrote:
> On 10.12.15 11:02 , Tomek Rekawek wrote:
>> ...Maybe we can change the configuration, so the
>> target/unittest.log file will be archived if the job is failed?...
>
> This would be useful anyway but I think so far no one has figured out how to
> do it

On Jenkins, "Archive the artifacts" in "Post-build Actions" should do it.

-Bertrand


Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Michael Dürig



On 10.12.15 11:02 , Tomek Rekawek wrote:

I’ve noticed that the failed builds can be assigned to a few
machines. For instance, builds on ubuntu-6, ubuntu-5, H11 and H10
fails almost every time. On the other hand, all builds on the
jenkins-ubuntu-1404-4gb succeedes.


Good find! Are these machines slower in general (i.e. for all builds) or 
only for the DocumentRDB builds? In the former case it might make sense 
to just increase the time out.




For the further diagnostics it could be useful to get the
unittest.log file from the failed build. Right now, the only way to
do it is to wait until some build fails and grab the file from the
workspace. The workspace is purged every time the jobs starts, so it
requires a good timing. Maybe we can change the configuration, so the
target/unittest.log file will be archived if the job is failed?


This would be useful anyway but I think so far no one has figured out 
how to do it. Unfortunately I can only grant committers access to that 
Jenkins. Maybe Tommaso could help at this point as he was involved in 
setting that instance up.


Michael



Best regards, Tomek




On 09/12/15 17:20, "Michael Dürig"  wrote:




On 9.12.15 12:02 , Michael Dürig wrote:




Do you have any suggestions?


Something must be way wrong as the build either takes ~30 minutes
or it times out after 90 minutes. There is quite a difference
here. Maybe we need to go through the log files and compare test
execution times of individual tests between timed out and
succeeded builds to get a clue.



Comparing test execution times for builds 593 and 595. The former
timed out and the latter passed. Both ran on Java 1.8.

Running org.apache.jackrabbit.oak.jcr.ValueFactoryTest Tests run:
2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 66.774 sec

vs.

Running org.apache.jackrabbit.oak.jcr.MoveRemoveTest Tests run: 12,
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.893 sec


Running org.apache.jackrabbit.oak.jcr.query.QueryTest Tests run:
27, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 541.876 sec

vs.

Running org.apache.jackrabbit.oak.jcr.query.QueryTest Tests run:
27, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 48.32 sec

So it seems that for some reason sometimes tests run 3 - 5 times
slower than normal, which with a usual build time of ~30 minutes
bumps us over the 90 minutes build time out.

Michael


See:
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/595/jdk=jdk1.8.0_11,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/consoleFull



https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/593/jdk=jdk1.8.0_11,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/console


Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Julian Reschke

On 2015-12-10 14:57, Tomek Rekawek wrote:

Hi,

I spent some time analysing the logs and I found out a strange thing. On a 
“slow” machine, in the surefire logs for the AtomicCounterTest (which takes 63 
sec while it should 3 sec), following test case appears [1]:

   


It’s a test case from a completely different class. I downloded all the 
surefire reports [2] and it seems that the test cases from the RepositoryTest 
class are spread across CRUDTest, ConflictResolutionTest, ObservationTest and 
others. Tests from the RepositoryTest class are executed in a different thread 
(pool-29-thread-1 in [3]). On the “fast” machines the problem doesn’t exist and 
all RepositoryTest methods are invoked within their own test case.

I don’t know yet what’s the cause of this, but will dig deeper.

Best regards,
Tomek


Wow. Great digging.

Maybe one thing to do first is to upgrade surefire to the latest version?

Best regards, Julian



Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Tomek Rekawek
Hi,

I spent some time analysing the logs and I found out a strange thing. On a 
“slow” machine, in the surefire logs for the AtomicCounterTest (which takes 63 
sec while it should 3 sec), following test case appears [1]:

  


It’s a test case from a completely different class. I downloded all the 
surefire reports [2] and it seems that the test cases from the RepositoryTest 
class are spread across CRUDTest, ConflictResolutionTest, ObservationTest and 
others. Tests from the RepositoryTest class are executed in a different thread 
(pool-29-thread-1 in [3]). On the “fast” machines the problem doesn’t exist and 
all RepositoryTest methods are invoked within their own test case.

I don’t know yet what’s the cause of this, but will dig deeper.

Best regards,
Tomek

[1] https://gist.github.com/trekawek/30311820122b139a6e44
[2] http://adobe.ly/1Rcao3V

[3] 
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/jdk=latest1.7,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/608/artifact/oak-jcr/target/unit-tests.log

On 10/12/15 12:44, "Michael Dürig"  wrote:
>>
>> Ok, I added a post build action for archiving **/target/unit-tests.log.
>> Let's see whether this works.
>
>Seems to work: 
>https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/607/jdk=jdk-1.6u45,label=Ubuntu,nsfixtures=SEGMENT_MK,profile=unittesting/


Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Marcel Reutegger
Hi,

hmm, this could be similar to
https://issues.apache.org/jira/browse/OAK-2704

I had a closer look at the DOCUMENT_RDB fixture and I think
the problem is with the createNodeStore() method. The method
is supposed to create independent node store instances, but
the RDB implementation uses an instance field for the data source.
this means when called concurrently one store may point to
the data source created by another thread.

Regards
 Marcel

On 10/12/15 14:57, "Tomek Rekawek" wrote:
>Hi,
>
>I spent some time analysing the logs and I found out a strange thing. On
>a ³slow² machine, in the surefire logs for the AtomicCounterTest (which
>takes 63 sec while it should 3 sec), following test case appears [1]:
>
>  classname="org.apache.jackrabbit.oak.jcr.RepositoryTest"
>name="importWithRegisteredType[RDBDocumentStore on
>jdbc:derby:oaktest\;create=true]"/>
>
>
>It¹s a test case from a completely different class. I downloded all the
>surefire reports [2] and it seems that the test cases from the
>RepositoryTest class are spread across CRUDTest, ConflictResolutionTest,
>ObservationTest and others. Tests from the RepositoryTest class are
>executed in a different thread (pool-29-thread-1 in [3]). On the ³fast²
>machines the problem doesn¹t exist and all RepositoryTest methods are
>invoked within their own test case.
>
>I don¹t know yet what¹s the cause of this, but will dig deeper.
>
>Best regards,
>Tomek
>
>[1] https://gist.github.com/trekawek/30311820122b139a6e44
>[2] http://adobe.ly/1Rcao3V
>
>[3] 
>https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/jdk=lates
>t1.7,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/608/artifact
>/oak-jcr/target/unit-tests.log
>
>On 10/12/15 12:44, "Michael Dürig"  wrote:
>>>
>>> Ok, I added a post build action for archiving **/target/unit-tests.log.
>>> Let's see whether this works.
>>
>>Seems to work: 
>>https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/607/jdk=
>>jdk-1.6u45,label=Ubuntu,nsfixtures=SEGMENT_MK,profile=unittesting/



Re: DocumentRDB builds time out on Jenkins

2015-12-10 Thread Julian Reschke

On 2015-12-10 15:22, Marcel Reutegger wrote:

Hi,

hmm, this could be similar to
https://issues.apache.org/jira/browse/OAK-2704

I had a closer look at the DOCUMENT_RDB fixture and I think
the problem is with the createNodeStore() method. The method
is supposed to create independent node store instances, but
the RDB implementation uses an instance field for the data source.
this means when called concurrently one store may point to
the data source created by another thread.
...


That per se wouldn't be a problem. But the dispose() method tries to 
close the DataSource, which of course might be different from the one 
that was used to create the NodeStore (will file an issue and fix that).


That again wouldn't be a problem if I hadn't fixed OAK-3722 last week. 
When did these timeouts start to appear?


Anyway, that still doesn't explain the confusion in the surefire logs, no?

Best regards, Julian






Re: DocumentRDB builds time out on Jenkins

2015-12-09 Thread Michael Dürig



On 9.12.15 11:58 , Julian Reschke wrote:

On 2015-12-09 09:42, Michael Dürig wrote:


Hi,

In the last 5 build on our Jenkins instance [1] we always had one of the
DocumentRDB jobs timing out [2]. This means the job didn't complete
within 90 minutes. This is not because we are close to the limit as
regular builds take roughly 30 minutes.

Is there anything we could do in the short term so we get at least
useful results back from Jenkins again?

Michael

[1] https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/
[2] https://issues.apache.org/jira/browse/OAK-3743



If it was always the same test then we should dig deeper. But as far as
I can tell, it isn't.


No it isn't a single test where it times out.



Do you have any suggestions?


Something must be way wrong as the build either takes ~30 minutes or it 
times out after 90 minutes. There is quite a difference here. Maybe we 
need to go through the log files and compare test execution times of 
individual tests between timed out and succeeded builds to get a clue.


Michael



Best regards, Julian



Re: DocumentRDB builds time out on Jenkins

2015-12-09 Thread Julian Reschke

On 2015-12-09 09:42, Michael Dürig wrote:


Hi,

In the last 5 build on our Jenkins instance [1] we always had one of the
DocumentRDB jobs timing out [2]. This means the job didn't complete
within 90 minutes. This is not because we are close to the limit as
regular builds take roughly 30 minutes.

Is there anything we could do in the short term so we get at least
useful results back from Jenkins again?

Michael

[1] https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/
[2] https://issues.apache.org/jira/browse/OAK-3743



If it was always the same test then we should dig deeper. But as far as 
I can tell, it isn't.


Do you have any suggestions?

Best regards, Julian



Re: DocumentRDB builds time out on Jenkins

2015-12-09 Thread Michael Dürig



On 9.12.15 12:41 , Julian Reschke wrote:


Something must be way wrong as the build either takes ~30 minutes or it
times out after 90 minutes. There is quite a difference here. Maybe we
need to go through the log files and compare test execution times of
individual tests between timed out and succeeded builds to get a clue.
...


Any chance we could get Jenkins to get a thread dump when aborting the VM?


I don't think so. But maybe we can hack something in somehow on our side?

Michael



Best regards, Julian


Re: DocumentRDB builds time out on Jenkins

2015-12-09 Thread Julian Reschke

On 2015-12-09 12:02, Michael Dürig wrote:



On 9.12.15 11:58 , Julian Reschke wrote:

On 2015-12-09 09:42, Michael Dürig wrote:


Hi,

In the last 5 build on our Jenkins instance [1] we always had one of the
DocumentRDB jobs timing out [2]. This means the job didn't complete
within 90 minutes. This is not because we are close to the limit as
regular builds take roughly 30 minutes.

Is there anything we could do in the short term so we get at least
useful results back from Jenkins again?

Michael

[1] https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/
[2] https://issues.apache.org/jira/browse/OAK-3743



If it was always the same test then we should dig deeper. But as far as
I can tell, it isn't.


No it isn't a single test where it times out.



Do you have any suggestions?


Something must be way wrong as the build either takes ~30 minutes or it
times out after 90 minutes. There is quite a difference here. Maybe we
need to go through the log files and compare test execution times of
individual tests between timed out and succeeded builds to get a clue.
...


Any chance we could get Jenkins to get a thread dump when aborting the VM?

Best regards, Julian


Re: DocumentRDB builds time out on Jenkins

2015-12-09 Thread Michael Dürig



On 9.12.15 12:02 , Michael Dürig wrote:




Do you have any suggestions?


Something must be way wrong as the build either takes ~30 minutes or it
times out after 90 minutes. There is quite a difference here. Maybe we
need to go through the log files and compare test execution times of
individual tests between timed out and succeeded builds to get a clue.



Comparing test execution times for builds 593 and 595. The former timed 
out and the latter passed. Both ran on Java 1.8.


Running org.apache.jackrabbit.oak.jcr.ValueFactoryTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 66.774 sec

vs.

Running org.apache.jackrabbit.oak.jcr.MoveRemoveTest
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.893 sec


Running org.apache.jackrabbit.oak.jcr.query.QueryTest
Tests run: 27, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 541.876 sec

vs.

Running org.apache.jackrabbit.oak.jcr.query.QueryTest
Tests run: 27, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 48.32 sec

So it seems that for some reason sometimes tests run 3 - 5 times slower 
than normal, which with a usual build time of ~30 minutes bumps us over 
the 90 minutes build time out.


Michael


See:
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/595/jdk=jdk1.8.0_11,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/consoleFull
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/593/jdk=jdk1.8.0_11,label=Ubuntu,nsfixtures=DOCUMENT_RDB,profile=unittesting/console