Re: Core build instability

2021-09-21 Thread Tobias Gruetzmacher
Hi,

Am Tue, Sep 21, 2021 at 03:14:57PM +0100 schrieb Tim Jacomb:
> There was a clear reliable failure change in 2.309 which was caused by
> a minor resource increase required.
> 
> But because the resources were so low on the CI (accidentally) it
> manifested as a problem there.

Just as a data point, I wasn't able to collect enough evidence yet: It
seems shortly after the release of 2.309 checkstyle was upgraded to
version 9.0, which switched from ANTLR2 to ANTLR4.

We have seen random OOM exceptions at my company due to
maven-checkstyle-plugin apparently not releasing memory after it has run
(Not thoroughly debugged yet), which leads to mvn processes being
~300-500MB larger in some builds...

Regards, Tobias

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/YUrGu1%2BYB6eTn295%4023.gs.


Re: Core build instability

2021-09-21 Thread Jesse Glick
Sorry to have implied that any action was required of you; I should have
phrased this as more of a “heads-up, possible regression under
investigation here”.

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1rfER6WF7APgLLuzMi2shh3_qoEvs%2B-M_oT_AkZhKwag%40mail.gmail.com.


Re: Core build instability

2021-09-21 Thread Basil Crow
On Tue, Sep 21, 2021 at 7:04 AM Jesse Glick  wrote:
> That was my best guess based on running `git bisect`: with the parallel class 
> loading, the docs generator failed; without it, the generator worked.

But this is just _data_; it doesn't mean anything unless we extract
the _insights_ out of it. To do that, we needed to understand _why_
the docs generator started failing.

> Sounds like the instability in core builds themselves was unrelated, a 
> coincidence?

Looks that way to me, despite the claim that jenkinsci/jenkins#5687
"seems to be the cause of recent OOMEs, […] intermittently here [in
core] (acc. to @timja)".

I understand that operational issues that cause builds/tests to fail
can be tough to track down. I am a professional operator, so I know!
But I get enough of that at the day job and am unwilling to volunteer
that type of work for this project. I am happy to fix regressions that
I have caused as a developer; all I ask is that a little more thought
be given to the root cause analysis before dragging me in.

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjprb8eVqCr%2Bq0i3H7paW%2BeSZhPYi3XnHo5Z%2BeGMaBDfjw%40mail.gmail.com.


Re: Core build instability

2021-09-21 Thread Tim Jacomb
> Sounds like the instability in core builds themselves was unrelated, a
coincidence?

I think it was another symptom compounded by us not having enough history
to see when it started very well.
There was a clear reliable failure change in 2.309 which was caused by a
minor resource increase required.

But because the resources were so low on the CI (accidentally) it
manifested as a problem there.

On Tue, 21 Sept 2021 at 15:04, Jesse Glick  wrote:

> On Mon, Sep 20, 2021 at 4:24 PM Basil Crow  wrote:
>
>> I do not think it is appropriate to imply that a developer caused a
>> regression […] simply because an operational failure occurred.
>>
>
> That was my best guess based on running `git bisect`: with the parallel
> class loading, the docs generator failed; without it, the generator worked.
> As mentioned, we have only speculated on what the real cause of the OOME
> was—something triggered by parallel class loading, which does not imply a
> root cause. For example, Jenkins might simply start faster in multiple
> threads, enabled by parallel class loading, and then do something unrelated
> to class loading which allocates lots of heap too quickly for GC to keep up.
>
> Sounds like the instability in core builds themselves was unrelated, a
> coincidence?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jenkinsci-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr341MSGhzzYaFN0aoMiXoNj8oLkUEjuh_eMdV3xqHKOnA%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CAH-3Bifpvk3uGS%2B-dRtxvUBnwUE-9H0-wCAO64JVdc4%3Dab%3Dm2Q%40mail.gmail.com.


Re: Core build instability

2021-09-21 Thread Jesse Glick
On Mon, Sep 20, 2021 at 4:24 PM Basil Crow  wrote:

> I do not think it is appropriate to imply that a developer caused a
> regression […] simply because an operational failure occurred.
>

That was my best guess based on running `git bisect`: with the parallel
class loading, the docs generator failed; without it, the generator worked.
As mentioned, we have only speculated on what the real cause of the OOME
was—something triggered by parallel class loading, which does not imply a
root cause. For example, Jenkins might simply start faster in multiple
threads, enabled by parallel class loading, and then do something unrelated
to class loading which allocates lots of heap too quickly for GC to keep up.

Sounds like the instability in core builds themselves was unrelated, a
coincidence?

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr341MSGhzzYaFN0aoMiXoNj8oLkUEjuh_eMdV3xqHKOnA%40mail.gmail.com.


Re: Core build instability

2021-09-20 Thread Tim Jacomb
Given this has been going on for weeks including people looking at flakey
tests and a lot of re-running of builds it was not clear at the time this
was because resources had actually changed.

We were looking for the root cause and thought you may have had an insight
into it, I would definitely expect to be pinged if a bisect had shown my
commit to have caused the job to start failing, (even if to know why
resource requirements have increased)

And interesting to know about the class loading lock I was not aware of
that :)

Thanks
Tim

On Mon, 20 Sep 2021 at 21:24, Basil Crow  wrote:

> On Mon, Sep 20, 2021 at 12:57 PM Jesse Glick  wrote:
> >
> > Any notion yet of why that would be?
>
> Why do you ask? The maximum heap size seems to have been 1516 MiB in
> e.g.
> https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/master/299/consoleFull
> but had dropped to 954 MiB by e.g.
>
> https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/master/322/consoleFull
> so the problem with pipeline-steps-doc-generator seems clear to me:
> the operators mistakenly reduced the memory size of the test system,
> and the job happened to continue to work for a while until organic
> growth exposed the original operational issue. With the operational
> issue resolved, PRs like jenkins-infra/pipeline-steps-doc-generator#92
> are now passing against recent core releases. As far as I can tell,
> this was a false alarm. I should not have been pinged about this.
>
> I do not think it is appropriate to imply that a developer caused a
> regression (for example, by describing jenkinsci/jenkins#5687 as "the
> culprit") simply because an operational failure occurred. The cause of
> the operational failure should be understood, and if that cause points
> to a regression caused by a developer (such as a memory leak), then
> the developer should be notified.
>
> Anyway, one theory is that the organic increase in heap usage may be
> coming from ClassLoader#getClassLoadingLock(String). If the
> ClassLoader object is registered as parallel-capable, this method
> returns a dedicated object associated with the specified class name;
> otherwise, it returns the ClassLoader object. Perhaps there are enough
> of these dedicated objects to cause a modest increase in heap usage on
> some installations (~300 MiB in the case of
> pipeline-steps-doc-generator).
>
> --
> You received this message because you are subscribed to the Google Groups
> "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jenkinsci-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjpyeqb_iVFgijd8_PqmSaZe7%3Dntk0FPNyrLqnGQZ7GwfA%40mail.gmail.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CAH-3BiduHyi9RuQc3EbV4Pk%2BW%2BhANbNND8y2f%3DKjxjYb-GFm2Q%40mail.gmail.com.


Re: Core build instability

2021-09-20 Thread Basil Crow
On Mon, Sep 20, 2021 at 12:57 PM Jesse Glick  wrote:
>
> Any notion yet of why that would be?

Why do you ask? The maximum heap size seems to have been 1516 MiB in
e.g. 
https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/master/299/consoleFull
but had dropped to 954 MiB by e.g.
https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/master/322/consoleFull
so the problem with pipeline-steps-doc-generator seems clear to me:
the operators mistakenly reduced the memory size of the test system,
and the job happened to continue to work for a while until organic
growth exposed the original operational issue. With the operational
issue resolved, PRs like jenkins-infra/pipeline-steps-doc-generator#92
are now passing against recent core releases. As far as I can tell,
this was a false alarm. I should not have been pinged about this.

I do not think it is appropriate to imply that a developer caused a
regression (for example, by describing jenkinsci/jenkins#5687 as "the
culprit") simply because an operational failure occurred. The cause of
the operational failure should be understood, and if that cause points
to a regression caused by a developer (such as a memory leak), then
the developer should be notified.

Anyway, one theory is that the organic increase in heap usage may be
coming from ClassLoader#getClassLoadingLock(String). If the
ClassLoader object is registered as parallel-capable, this method
returns a dedicated object associated with the specified class name;
otherwise, it returns the ClassLoader object. Perhaps there are enough
of these dedicated objects to cause a modest increase in heap usage on
some installations (~300 MiB in the case of
pipeline-steps-doc-generator).

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjpyeqb_iVFgijd8_PqmSaZe7%3Dntk0FPNyrLqnGQZ7GwfA%40mail.gmail.com.


Re: Core build instability

2021-09-20 Thread Tim Jacomb
Pending https://github.com/jenkins-infra/jenkins-infra/pull/1872 being
approved I've manually changed the settings and we finally have a passing
build for stable.
(in https://github.com/jenkinsci/jenkins/pull/5729)



On Mon, 20 Sept 2021 at 20:57, Jesse Glick  wrote:

> On Mon, Sep 20, 2021 at 3:37 PM Basil Crow  wrote:
>
>> I *do* see evidence that registering AntClassLoader (specifically) as
>> parallel-capable has increased the heap size requirement
>>
>
> Any notion yet of why that would be? It should be loading the same set of
> classes, just at slightly different times, unless I am missing something.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jenkinsci-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1TTMzpAW-J9Q9ZXZZ%3DqVbc8VNmAvec%3D4G7Am8XwyYspQ%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CAH-3Bidwm8hEEkAJTOuvcOQZ7PAEOONXdZRtpw-3tmOjnRzBjA%40mail.gmail.com.


Re: Core build instability

2021-09-20 Thread Jesse Glick
On Mon, Sep 20, 2021 at 3:37 PM Basil Crow  wrote:

> I *do* see evidence that registering AntClassLoader (specifically) as
> parallel-capable has increased the heap size requirement
>

Any notion yet of why that would be? It should be loading the same set of
classes, just at slightly different times, unless I am missing something.

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1TTMzpAW-J9Q9ZXZZ%3DqVbc8VNmAvec%3D4G7Am8XwyYspQ%40mail.gmail.com.


Re: Core build instability

2021-09-20 Thread Basil Crow
I see no evidence that jenkinsci/jenkins#5687 has introduced a leak,
so I do not think it should be reverted. I _do_ see evidence that
registering AntClassLoader (specifically) as parallel-capable has
increased the heap size requirement for pipeline-steps-doc-generator:
1280 MiB seems to be sufficient, while what JVM ergonomics picked for
e.g. 
https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-92/1/consoleFull
(945 MiB) is insufficient. My recommendation to operators is to adjust
the hardware and/or -Xmx settings to ensure that a sufficiently large
heap is provided.

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CAFwNDjr8n9%2BLhHayps4Sfd0Ox%2BtoCrG2KMHdjNf4%2Bc2MJn9xfQ%40mail.gmail.com.


Re: Core build instability

2021-09-20 Thread Tim Jacomb
Thanks for tracking down where the memory issue appears to be coming from
in
https://github.com/jenkins-infra/pipeline-steps-doc-generator/pull/94#issuecomment-923094344

I think the other issue is the CPU count appears to have been accidentally
reverted to 2 cores =/
It was increased here:
https://github.com/jenkins-infra/jenkins-infra/commit/513092b2da8a08cc605bb32fa924fa7b1b260cac#diff-e10b8e08a0aba3a716e4500cd9e812f13511e6ea996705082c3cc6a612074b52

Haven't tracked down exactly where it was changed

Tim


On Mon, 20 Sept 2021 at 13:39, Jesse Glick  wrote:

> So we have the JNA upgrade, XStream upgrade, and parallel class loading. I
> will try to bisect the cause.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jenkinsci-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1ZGAciWuRiY7VHdNT%2B%2B1osPnRUxiZSC_k1UxVz-AOriQ%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CAH-3Bif-7fDCX9FXo-wVsZQ%3DeUqgfOJdaAhv1fLEspeRmJRAsg%40mail.gmail.com.


Re: Core build instability

2021-09-20 Thread Jesse Glick
So we have the JNA upgrade, XStream upgrade, and parallel class loading. I
will try to bisect the cause.

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr1ZGAciWuRiY7VHdNT%2B%2B1osPnRUxiZSC_k1UxVz-AOriQ%40mail.gmail.com.